File Systems Gone Bad

File Systems Gone Bad

Copyright(c) Management Analytics, 1995 - All Rights Reserved

Copyright(c), 1990, 1995 Dr. Frederick B. Cohen - All Rights Reserved

Problem:

A little while ago, a cleaning lady sprayed one of my disk drives with cleaning solution, and very nearly caused a disk crash. In fact, there were many transient errors, and the system reported a disk crash, but through some miracle, the system did not go down. I did an immediate backup onto tape, and all was well (whew!). The greatest fear of the computer user is not death by fire, it is a disk crash.

Under UNIX, file systems are not completely stored on the disk. In order to enhance performance, many systems keep portions of the file system in a memory cache area. As a result, if the system is simply turned off, there may be an inconsistent state stored on the disk. Fortunately, UNIX file systems normally contain enough redundant information to recover from many such problems. The recovery process is normally performed as an automatic consequence of bootup disk checks, but in some cases systems administrators have to cleanup disks during normal operation.

Prevention:

File-system failures cannot be completely prevented, but there are some important techniques to help reduce the rate of occurrence. The most common fault leading to a file-system failure is a power failure. Because most UNIX systems cache file-system changes to enhance performance, a power failure at the wrong time may be catastrophic. The best defense is an uninterruptable power supply (UPS). In my facility, we experience power failures or serious fluctuations more than 20 times per year. Without the UPS, we would have massive problems, but with the UPS, we haven't lost file information in over 15 years of timesharing under UNIX.

Detection:

File system crashes are very easily detected, because UNIX systems perform automatic self-test at bootup. They also tend to produce obvious and dramatic results.

Cure:

The only real cure for otherwise irreperable file-system failures is restoration from backups.