flak rss random

junk filled files

At EuroBSDCon, Taylor Campbell (of NetBSD) presented on Tricky issues in file systems. Many of the issues discussed were relevant to any filesystem, some were particular to FFS. One of the topics covered was the appearance of junk filled files after a system crash. Ideally, we’d like our filesystem to remain consistent and prevent this from happening.

To recap, in order for a filesystem to remain consistent, or at least be recoverable to a consistent state, multipart operations must be performed in a strict order that allows partial rollbacks after a crash. This is what fsck does. Between each step, we must wait for the previous step to complete (be written to disk). Synchronous operations can become a bottleneck, and so there’s a lot of research into either avoiding them or speeding them up. If the operations are performed asynchronously, that would certainly be faster, but more dangerous.

One such operation is writing data to a file. First, we must allocate some data blocks. Second, we must write the data to disk. Third, we must update the inode to point to the newly written data. If we allow the first and third steps to complete, but don’t wait for the second, and then the system crashes, one will observe junk in the file. fsck cannot detect this, because the space looks to be properly allocated from the perspective of both the free block bitmap and the inode block pointers. Depending on the age of the filesystem, you will either see zeroes on a fresh disk or the old contents of previously deleted files as blocks are reused. Obviously, this is undesirable.

In theory, the steps are supposed to be strictly ordered. In practice, this particular operation, writing a file, is one that people really like to go fast, and the synchronous requirement is relaxed. Steps one and three are strictly ordered, since this is necessary to keep the filesystem metadata structures sane, but the data is allowed to be corrupted.

The likelihood of this event happening depends on a few factors, all of which effect the amount of time during which step 3 may have completed but step 2 not. Ironically, journalling may make the problem worse, since step 3 will hit the disk (in the log) at the same time as step 1, but data is not typically journalled. Large disks with large buffer caches and simple elevator algorithms are bad news. If the inode to be written is physically far away from the data block, and there’s a lot of other filesystem activity, the data block can be effectively starved as more and more requests are inserted into the queue between them. (OpenBSD switched to NSCAN a few releases back to address this problem.)

After Taylor’s talk, there was a question about whether softdep made this problem better or worse. A combination of discussion, disagreement, and doubt followed. (softdep wasn’t mentioned during the talk, probably because NetBSD has removed it.) I had the impression that softdep fixed the problem, but it had been a few years since I’d thought about the issue and I couldn’t remember exactly how I came to have this impression, so perhaps one might call it an assumption.

When in doubt, do some research. The Soft Updates by McKusick and Ganger paper from June 1999 USENIX (my, what a good BSD conference that was) explains both the problem and the solution.

When a new block is allocated, its bitmap location is
updated to reflect that it is in use and the block's con-
tents are initialized with newly written data or zeros.
In addition, a pointer to the new block is added to an
inode or indirect block (see below). To ensure that the
on-disk bitmap always reflects allocated resources, the
bitmap must be written to disk before the pointer.
Also, because the contents of the newly allocated disk
location are unknown, rule #1 specifies an update
dependency between the new block and the pointer to
it. Because enforcing this update dependency with
synchronous writes can reduce data creation through-
put by a factor of two [Ganger & Patt, 1994], many
implementations ignore it for regular data blocks. This
implementation decision reduces integrity and secu-
rity, since newly allocated blocks generally contain
previously deleted file data. Soft updates allows all
block allocations to be protected in this way with near-
zero performance reduction.

So, yes. softdep should make things better. The relevant structure in the code is called struct allocdirect, and there’s a comment in softdep.h which explains how it works in addition to the paper. softdep adds a dependency between the inode and the data blocks, ensuring that the inode will not be written out until after the data blocks.

Posted 06 Oct 2015 02:27 by tedu Updated: 06 Oct 2015 02:27
Tagged: openbsd software