Recovering Failed Filesystems

How do you know a filesystem is failing? Typically the certain clue is very slow writes, although you may also see errors at the command line and failures to mount. If your partition table is good but you can’t mount a particular partition, you may have a small write error, or you may have a disk flaw. If the latter, it’s all over for that partition; but if the former, it’s time for a disk check.

Back up the partition first. You’re likely to need to mount a spare hard drive or an external USB hard drive. We’ll assume you’ve mounted it to /mnt/rescue. Command:

dd if=/dev/hda1 of=/mnt/rescue/hda1.img

This assumes the first partition is the bad one; adjust as necessary. If you have to restore, you can command:

dd if=/mnt/rescue/hda1.img of=/dev/hda1

Now run the filesystem check program fsck. Command:

fsck /dev/hda1

You can force a filesystem-type check by specifying it like this:

fsck -t ext3 /dev/hda1 #assuming it’s an ext3 partition

or:

fsck /dev/hda1 — -f #to force filesystem-specific checks
# — passes subsequent arguments to the fs-specific check
# -f forces

Other options:

-c searches for bad blocks on the disk

-p forces automatic repair without asking for confirmation.

 

Time to boot from your distro installation CD. Then, in Red Hat/Fedora, when you get to the Welcome screen, command:

linux rescue

Then it’s time to make a filesystem on your hard disk, and restore a backup of your system’s contents. You did make backups, right?

Recovering A Lost Partition Table

Oops, you messed up with fdisk. Or the actual partition table got corrupted; who cares why.

Once again, you need to boot to a rescue disk. Then command:

fdisk

Now type:

p

to perform a check of the partition table. If you get something strange, or complete gibberish, type:

x

to enter expert mode. (Yes, you’re the expert.) Now type:

c

and enter the cylinders value you wrote down from page 1 of this lesson. Then type:

h

and enter the number of heads. Then type:

s

and enter the number of sectors. Now type:

r

to return to the main menu. Type:

p

again and look for gibberish partitions. If you have any, delete them with the d option. Type:

n

and restore your original partitions. Precisely. Then type:

p

to review partitions again. If you still have a mess, it’s likely your disk is shot. Too bad. But if things are good, be sure to compare the current values to your recorded values. If necessary, try to correct them again. Finally, type:

w

to write the changes, then exit fdisk.

Can you mount your partitions now (in your rescue environment)? If not, see the next section. Otherwise, try a reboot now. Good luck; I’ve had this work, and you will too – sometimes.

Recovering From A Lost root Password

Told you to write it down, didn’t I? But if your security nerves cringed at the idea, I don’t blame you. There is a way to recover if you’ve flatly lost it.

You’re going to need one of the emergency systems discussed on the previous page. Have one, and boot to it.

Mount the main system’s filesystem, or at least the root of the filesystem ( / ). Open, in your preferred text editor, the /<your_mount_point>/etc/shadow file (if you’re using shadow passwords) or, in some cases, the /<your_mount_point>/etc/passwd file (if you’re using NIS, for instance). Now you have to find the line holding root’s password:

root:$1$lkjh08jern0<long hash value>…

Note that this line is colon-delimited fields; you want the second field. It usually begins with $1$.

Delete this field. Be careful; leave the colons on either side. Now root can log in with no password. Yes, this is scary.

Disconnect the workstation from the network! Reboot to the workstation’s main file system (not your rescue CD or partition or what have you). Log in as root with a blank password. Change this password immediately.

Your Emergency Toolkit

Carry A Live-CD, DVD, Zip or Floppy Distro (or all of them)

Count on it. You’re going to encounter damaged systems with any OS. You should have a handful of tools ready for these moments.

Use Ghost Or Partimage To Make An Image Of The Workstation

This is the greatest method in the world. You need an external USB hard drive or CD drive to create the image, many of which will come with a copy of Ghost. If you don’t have Ghost, go get Partimage and make an image of the entire workstation disk. If you’re in a true workstation environment, you’re likely using NIS, which allows you to keep your users’ personal files in a /home directory on your NIS/NFS server. Make users keep their own stuff in their own home (this is a very, very good policy for Windows workstations too).

Then, if a machine craters, restore the image and they’re right back where they left off. If the hard disk or some other hardware is destroyed, replace it and restore the image. If the whole workstation goes up in smoke, generally you can still get away with using the image on different hardware, though you’ll have to deal with kudzu’s demands for modifying hardware (no sweat) or reconfiguring Xwindows (tricky, but not impossible at all for a sysadmin like you).

Add a Minimal Side-by-Side Installation

Either before or after the fact, you can add a minimal side-by-side installation (not even the same distro) on your workstations. Most distros will allow you simply to add a separate installation to GRUB or LiLO when you set them up. Often a simple command-line installation on a different partition will do the trick.

*Plan for this in advance by setting up an extra partition when you set up the main installation.*

Keep the Original Distro Installation Tools At Hand

Needless to say, the original installation media is worth far more than its weight in gold when there’s a problem.

Most distros also include a Recovery Disk CD; Fedora, Red Hat, SuSE and many others include these with the downloadable distros. Treat these CDs like they’re diamonds! Practice with one on a non-production machine.

Live-CD Distros

I personally prefer Knoppix as a Live-CD distro (http://www.knoppix.org). It doesn’t change your local filesystem in any way when you start your system from the CD, and it does provide a very comprehensive set of standard Unix tools.

System Rescue CD (http://sysresccd.org) is a terrific and flexible tool. Visit the home page and the tools page for excellent discussion of its capabilities. It’s built with the livecd-ng script, written by Daniel Robbins, which allows you to create your own version of the CD. There’s an option to build a great DVD version, complete with backup files and/or an image of the original operating system on the workstation, including the Partimage tool, which is a Ghost-like program for Linux. Amazingly, this CD is available in both English and French, as well as a version for blind users!

The FIRE distro CD (http://biatchux.dmzs.com) is a not-for-dummies system that’s highly useful, though a little behind the curve in development.

The openSuSE Demo DVD (http://www.novell.com/linux/suse) is great for systems with a DVD drive, and particularly valuable if you’re using a SUSE Enterprise or openSUSE distro on your workstation.

Sub-CD Distros

Zip-disk based:

If you’re using a Zip drive you can carry a lot more tools than a floppy, and still access systems that don’t have CD drives. ZipSlack, and variant of Slackware, is the tool of choice here, but you must be comfortable in a console-only (i.e. command-line) environment. You can do some fancy stuff with LoadLin startup to get to the Zip disk, or use a floppy (an image is supplied) to boot. See the booting page for information.

Floppy-based:

Now you’re at the extreme. If you’ve got a laptop running Linux with no floppy drive (which is becoming more and more common), this is a great way to get it up and running after a system failure, then mount the (hopefully undamaged) hard disk filesystem.

muLinux is an older distro, using a 2.0 kernel, that lets you load add-on floppies after booting.

Tom’s Root/Boot Disk is a flexible system that also lets you build a custom CD if you want a larger, customized system. Check out his wiki for great discussion. Particularly see his list of “Bootable CD Things” for a great list of other CD distros.

Software Problems

Check your logs first

The first place to look for problems is in your logs. Go to /var/log and see if there is a directory named for the application that’s giving you problems. For instance, Apache has a log at /var/log/httpd.

 

Finding and fixing problems using package managers

The Red Hat Package Manager (RPM) handles software installation, verification and removal under both Red Hat family and SUSE family distributions.

One of the first things to do if you suspect software failure, corruption or hacking is to verify the suspect package’s installation and files. First you will likely have to find the correct name for the package. This is probably most easily done using the RPM Search at http://rpm.pbone.net/. Or you can try trial and error:

rpm -q openoffice

rpm -q OpenOffice

rpm -q openoffice_org

rpm OpenOffice_org #bingo!

rpm -V OpenOffice_org #the actual verification

Under Debian distos, Synaptic (the apt utility) handles similar functionality.

 

Resolving problems with shared libraries

This one is good if you know where your problem lies: Apache won’t run, for instance.

First, you’ll need (again) to know the name of the package: in this case, Apache is installed as the httpd package.

rpm -q httpd #is it installed?

whereis httpd #where is the binary?

ldd /usr/sbin/httpd #shows all libraries used by httpd

whereis libz #one at a time,
#confirm that these libraries are indeed present

If one is indeed missing, it’s time to find it. Once again, go to RPM Search at http://rpm.pbone.net/ and search for packages that contain the library for your distribution, your distro release version, and your processor architecture. It’s easier than it sounds:

In this case, for instance, searching on the term “libz” resulted in lots of incorrect results; I had to search on “libz.” to get the correct library.

Once you’ve got the package, install with RPM as usual:

rpm -Uvh libz.rpm

After installing, make sure that the list of shared library directories (/etc/ld.so.conf) and the list of shared libraries themselves (/etc/ld.so.cache) are updated using this command:

ldconfig

Note that when no installer is provided, for instance when you download a simple .tar.gz package, you must copy libraries to their default locations manually, and run the ldconfig command to add them to the conf and cache files.

 

Running out of filehandles

This is a classic problem for Apache when you are running lots of virtual hosts. See the Apache.org page on this, http://httpd.apache.org/docs/2.0/vhosts/fd-limits.html.

By default, programs can open only up to 1024 files. On web servers in particular, use the ulimit command to change the upper limit, i.e. the maximum number of files:

ulimit -n 5000 #-n sets filehandle limit

You can experience a similar problem with the upper limit of child processes. Use:

ulimit -u 8000 #-u sets user processes limit

 

Boot problems

LILO:
Edit /etc/lilo.conf to change “compact” to “linear”.

GRUB:
Typically GRUB errors result from missing files in /boot. (I TOLD you to back up this directory!)

Remember the boot disk I told you to make?

Hardware Problems

Okay, by this stage in your computer technologies career you’re already familiar with the usual litany:

  • Check out any POST errors you get,
  • Make sure SCSI drives are terminate,
  • Make sure it’s plugged in,
  • Make sure it’s turned on,
  • etc. etc. etc. ….

Things get trickier fast once you reach the stage of checking IRQ and I/O addresses. At this point you must remember one thing:

Log files are the ultimate resource for hardware diagnosis.

 

Take a good detailed look at what happened during the last boot. Command:

dmesg > lastboot.txt

then:

less lastboot.txt

This gives you a good view into the boot process, in agonizing detail.

You may also have information (or you may not) in /var/log/boot.log. Check it and /var/log/messages, the main system log file.

less /var/log/boot.log

less /var/log/messages

 

Modems and serial ports look like the same thing to your Linux system. Thus they can sometimes fight for the same IRQs and I/O addresses. Use the setserial command to set IRQ, I/O, and port speed:

setserial /dev/ttyS0 irq 11 port 0x03f8

and check the results:

setserial /dev/ttyS0

The relevant keywords are:

port n Sets I/O address
irq n Sets IRQ to n
auto_irq Tries to detect IRQ setting
spd_hi Sets serial port speed to 56KB/s
spd_vhi Sets serial port speed to 115KB/s
spd_normal Sets serial port speed to 38.4KB/s

Document Your System (Or Die!)

1. Write Down the root Password.

Yes, this is a bad security practice. Yes, do it anyway. Keep a highly-protected notebook, reverse or scramble the password, use simple encryption (letter-shift and number-shift three letters or numbers to the left or right, for instance) – choose a method of keeping this information secure, but keep this information available. You will forget this someday.

2. Write Down Your Partition Layout.

Command:

fdisk -l /dev/hda #assuming your hard disk is hda

and write this information down on paper. You want all of it:

  • the device label,
  • whether it’s the boot partition,
  • the starting and ending sectors,
  • the total size,
  • the filesystem type ID,
  • and the system type of each partition.

3. Copy the Partition Table.

First, write the table information down. Command:

cat /etc/fstab

Capture every bit of this information. Yes, every single letter.

Next, take an electronic copy of the partition table. Let’s assume you’re using a USB drive. Mount it using something like this:

mount -t fat32 /dev/sda /mnt/custom

Then capture the partition table:

sfdisk -d /dev/sda > /mnt/custom/ptable.bak

Later you can restore it with:

sfdisk /dev/sda < /mnt/custom/ptable.bak

 

4. Write Down the Distro and Version.

Sure, you know this now. Will you later? Be doubtful.

5. Print out any instructions that come with your distro or its recovery disk.

6. Back Up Critical Directories

Obviously you want to back up /home, but you should also back up /boot, especially if you use GRUB.

7. Make a boot (rescue) disk

Use the mkbootdisk command with the kernel name as the argument. Obviously you must have this kernel on your system:

uname -r #tells you which kernel you’re running

mkbootdisk 2.4.21-20 #use the version number from the above

Disk Quotas

Hard disk quotas are limits placed on users or groups

Soft limits are disk usage limits that can be exceeded for only a set period of time (usually sevel days).

Hard limits are just that – users/groups simply cannot exceed these disk usage settings.

Limits on the number of blocks (remember, blocks are typically 1 Kilobyte) and the number of inodes (which means, numbers of files – and don’t forget that directories are files) can be placed on either users or groups.

 

Disk quotas are not enabled by default.

You must enable them by adding usrquota to the mount entry in the /etc/fstab file:

/dev/hda1 / ext2 defaults,usrquota,grpquota 1 2

Then the system must be rebooted, or the partition can be remounted:

mount / -o remount,rw

 

Create aquota.user and aquota.group files using:

touch /aquota.group /aquota.user

 

Create a disk usage table for the affected partition (in this case, hda1):

quotacheck –mavug /

-m – update even if filesystem is in use by other processes
-a – for all filesystems with usrquota or grpquota specified
-v – verbose
-u – turn on user quotas
-g – turn on group quotas

 

Turn on quotas

This is really simple. Use quotaon with the filesystem as its argument:

quotaon /

You can also turn quotas off:

quotaoff /

 

Set up user quotas

Now you’ll be glad you’re familiar with vi. Command:

edquota -u username

Voila – you’re in vi, where you can change a text file to set soft and hard limits for blocks and inodes:

Disk quotas for user glenn (uid500):
Filesystem _ blocks _ soft _ hard _ inodes _ soft _ hard
/_________ 1000 ___ 0____ 0 ___ 400 ___ 0 ___ 0

Soft or hard limits of 0 mean no limits.

“Filesystem” means just that: the mounted filesystem by name, for instance / .
“Blocks” means blocks currently used by this user.
The first “soft” is the column where soft block limits can be set.
The first “hard” is the hard block limit.
“Inodes” means inodes currently in use by this user.
The second “soft” is the soft inode limit for this user.
The second “hard” is the hard inode limit.

Use:

edquota -g groupname

to modify settings for groups.

 

Edit the default time limit for users exceeding their soft quotas

Use the command:

edquota -u -t

Grace period before enforcing soft limits for users:
Time units may be: days, hours, minutes or seconds
_Filesystem _Block grace period _Inode grace period
__/___________7days____________7days

 

Create a quota startup script

For example:

# Check quota and then turn quota on.
if [ -x /sbin/quotacheck ]
then
echo “Checking quotas. This may take some time.”
/sbin/quotacheck -avug
echo ” Done.”
fi
if [ -x /sbin/quotaon ]
then
echo “Turning on quota.”
/sbin/quotaon -mavug
fi

This script will be /etc/init.d/quota.

 

Turn on the quota script upon boot

Issue the commands:

chmod 755 /etc/init.d/quota

chkconfig on quota

 

Staying On Top of Disk Quotas

Periodically use the

quotacheck

command to keep users’ and groups’ disk usage limits current.

This command checks and enforces quota limits.

 

Add this line to your crontab file to run quotacheck weekly:

0 3 * * 0 /sbin/quotacheck -avug

 

Use the

repquota -a

command to read quota data for all users and groups.

 

Checking Disk Usage

Use du to check on specific users’ usage.

Issue:

du –h –c –s /home/user1 /home/user2 /home/user3

This is a handy command to run from cron and pipe to mail to the system admin.

Deleting Accounts

To delete a user:

userdel <user name>

And delete their home directory:

userdel –r <user name>

 

Now that you know this, as a general practice, don’t delete users. (Why is this a bad idea?)

 

To find all files owned by a user (outside their home directory):

find / -user <user name> #use this formulation on a command line

find / -user <user name> -exec rm –i {} \; #use this formulation
#in a script

find / -user <user name> -exec chown <new user> {} \;

The string { } is replaced by the file name as find finds it.

In the last example, the file names will be processed until the character ; is found. At that point execution stops. The backslash \ is the escape character. The semicolon could have been protected with single quotes just as effectively.

Modifying & Locking Accounts

Modifying Accounts

usermod works like useradd, and shares many of the options: you can change their group, home directory or even their user name:

usermod –l new_name old_name

allows you to change a user’s login name.

-e changes the expiration date

-g changes their primary group

-G adds them to a secondary group

-d changes their home directory

 

Red Hat User Manager also lets you change any feature of a user account in the GUI.

 

Note all the password operations:

expiration,

change allowed,

change required,

warning before expiration,

days inactive before deactivation.

Also note the “locked” option.

 

Changing Expiration Dates

There is one area usermod can’t configure: password expiration. For this you’ll use the chage command instead. This one is best seen through example:

chage -m 3 -M 30 -W 5 <username>

This results in:

a minimum ( -m ) interval after changing a password of 3 days,

a maximum ( -M ) interval of 30 days between password changes, and

a warning ( -W ) 5 days before password expiration.

 

Locking an Account

Locking is inherently a temporary thing, so nothing is deleted when an account is locked. The commands:

usermod -L <username>
or
passwd -l <username>

lock the account by placing a ! character in front of the encrypted password in /etc/shadow. The commands:

usermod -U <username>
or
passwd -u <username>

then unlock the account by removing the bang ( ! ) character.

 

One other method of locking an account that you might run into is changing the user’s default shell, preventing them from accessing the system:

chsh -s /bin/fake <username>