Btrfs design

Btrfs is implemented with simple and well known constructs. It should perform well, but the long term goal of maintaining performance as the FS system ages and grows is more important than winning a short lived benchmark. To that end, benchmarks are being used to try and simulate performance over the life of a filesystem.

Btree Data structures

The Btrfs btree provides a generic facility to store a variety of data types. Internally it only knows about three data structures: keys, items, and a block header:

struct btrfs_header {
    u8 csum[32];
    u8 fsid[16];
    __le64 blocknr;
    __le64 flags;

    u8 chunk_tree_uid[16];
    __le64 generation;
    __le64 owner;
    __le32 nritems;
    u8 level;
}

struct btrfs_disk_key {
    __le64 objectid;
    u8 type;
    __le64 offset;
}

struct btrfs_item {
    struct btrfs_disk_key key;
    __le32 offset;
    __le32 size;
}

Upper nodes of the trees contain only [ key, block pointer ] pairs. Tree leaves are broken up into two sections that grow toward each other. Leaves have an array of fixed sized items, and an area where item data is stored. The offset and size fields in the item indicate where in the leaf the item data can be found. Example:

Item 0 ...

Item N

Free Space

...

Free Space

Data for itemN ...

Data for Item 0

Item data is variably size, and various filesystem data structures are defined as different types of item data. The type field in struct btrfs_disk_key indicates the type of data stored in the item.

The block header contains a checksum for the block contents, the uuid of the filesystem that owns the block, the level of the block in the tree, and the block number where this block is supposed to live. These fields allow the contents of the metadata to be verified when the data is read. Everything that points to a btree block also stores the generation field it expects that block to have. This allows Btrfs to detect phantom or misplaced writes on the media.

The checksum of the lower node is not stored in the node pointer to simplify the FS writeback code. The generation number will be known at the time the block is inserted into the btree, but the checksum is only calculated before writing the block to disk. Using the generation will allow Btrfs to detect phantom writes without having to find and update the upper node each time the lower node checksum is updated.

The generation field corresponds to the transaction id that allocated the block, which enables easy incremental backups and is used by the copy on write transaction subsystem.

Filesystem Data Structures

Each object in the filesystem has an objectid, which is allocated dynamically on creation. A free objectid is simply a hole in the key space of the filesystem btree; objectids that don't already exist in the tree. The objectid makes up the most significant bits of the key, allowing all of the items for a given filesystem object to be logically grouped together in the btree.

The offset field of the key stores indicates the byte offset for a particular item in the object. For file extents, this would be the byte offset of the start of the extent in the file. The type field stores the item type information, and has extra room for expanded use.

Inodes

Inodes are stored in struct btrfs_inode_item at offset zero in the key, and have a type value of one. Inode items are always the lowest valued key for a given object, and they store the traditional stat data for files and directories. The inode structure is relatively small, and will not contain embedded file data or extended attribute data. These things are stored in other item types.

Files

Small files that occupy less than one leaf block may be packed into the btree inside the extent item. In this case the key offset is the byte offset of the data in the file, and the size field of struct btrfs_item indicates how much data is stored. There may be more than one of these per file.

Larger files are stored in extents. struct btrfs_file_extent_item records a generation number for the extent and a [ disk block, disk num blocks ] pair to record the area of disk corresponding to the file. Extents also store the logical offset and the number of blocks used by this extent record into the extent on disk. This allows Btrfs to satisfy a rewrite into the middle of an extent without having to read the old file data first. For example, writing 1MB into the middle of a existing 128MB extent may result in three extent records:

[ old extent: bytes 0-64MB ], [ new extent 1MB ], [ old extent: bytes 65MB – 128MB]

File data checksums are stored in a dedicated btree in a struct btrfs_csum_item. The offset of the key corresponds to the byte number of the extent. The data is checksummed after any compression or encryption is done and it reflects the bytes sent to the disk.

A single item may store a number of checksums. struct btrfs_csum_items are only used for file extents. File data inline in the btree is covered by the checksum at the start of the btree block.

Directories

Directories are indexed in two different ways. For filename lookup, there is an index comprised of keys:

Directory Objectid

BTRFS_DIR_ITEM_KEY

64 bit filename hash

The default directory hash used is crc32c, although other hashes may be added later on. A flags field in the super block will indicate which hash is used for a given FS.

The second directory index is used by readdir to return data in inode number order. This more closely resembles the order of blocks on disk and generally provides better performance for reading data in bulk (backups, copies, etc). Also, it allows fast checking that a given inode is linked into a directory when verifying inode link counts. This index uses an additional set of keys:

Directory Objectid

BTRFS_DIR_INDEX_KEY

Inode Sequence number

The inode sequence number comes from the directory. It is increased each time a new file or directory is added.

Reference Counted Extents

Reference counting is the basis for the snapshotting subsystems. For every extent allocated to a btree or a file, Btrfs records the number of references in a struct btrfs_extent_item. The trees that hold these items also serve as the allocation map for blocks that are in use on the filesystem. Some trees are not reference counted and are only protected by a copy on write logging. However, the same type of extent items are used for all allocated blocks on the disk.

Extent Block Groups

Extent block groups allow allocator optimizations by breaking the disk up into chunks of 256MB or more. For each chunk, they record information about the number of blocks available. Files and directories will have a preferred block group which they try first for allocations.

Block groups have a flag that indicate if they are preferred for data or metadata allocations, and at mkfs time the disk is broken up into alternating metadata (33% of the disk) and data groups (66% of the disk). As the disk fills, a group's preference may change back and forth, but Btrfs always tries to avoid intermixing data and metadata extents in the same group. This substantially improves fsck throughput, and reduces seeks during writeback while the FS is mounted. It does slightly increase the seeks while reading.

Extent Trees and DM integration

The Btrfs extent trees are intended to divide up the available storage into a number of flexible allocation policies. Each extent tree owns a section of the underlying disk, and they can be assigned to a collection of (or a single) tree roots, directories or inodes. Policies will direct how a given allocation is spread across the extent trees available, allowing the admin to direct which parts of the filesystem are striped, mirrored or confined to a given device.

Btrfs will try to tie in with DM in order to easily manage large pools of storage. The basic idea is to have at least one extent tree per spindle, and then allow the admin to assign those extent trees to subvolumes, directories or files.

Explicit Back References

Back references have three main goals:

Differentiate between all holders of references to an extent so that when a reference is dropped we can make sure it was a valid reference before freeing the extent.
Provide enough information to quickly find the holders of an extent if we notice a given block is corrupted or bad.
Make it easy to migrate blocks for FS shrinking or storage pool maintenance. This is actually the same as #2, but with a slightly different use case.

File Extent Backrefs

File extents can be referenced by:

Multiple snapshots, subvolumes, or different generations in one subvol
Different files inside a single subvolume
Different offsets inside a file

[The remainder of this section refers to the extent_ref_v0 structure, which is not used on current btrfs filesystems]

The extent ref structure has fields for:

Objectid of the subvolume root
Generation number of the tree holding the reference
objectid of the file holding the reference
offset in the file corresponding to the key holding the reference

When a file extent is allocated the fields are filled in:

(root objectid, transaction id inode objectid, offset in file)

When a leaf is cow'd new references are added for every file extent found in the leaf. It looks the same as the create case, but the transaction id will be different when the block is cow'd.

(root objectid, transaction id, inode objectid, offset in file)

When a file extent is removed either during snapshot deletion or file truncation, the corresponding back reference is found by searching for:

(btrfs_header_owner(leaf), btrfs_header_generation(leaf), inode objectid, offset in file)

Btree Extent Backrefs

Btree extents can be referenced by:

Different subvolumes
Different generations of the same subvolume

Storing sufficient information for a full reverse mapping of a btree block would require storing the lowest key of the block in the backref, and it would require updating that lowest key either before write out or every time it changed.

Instead, the objectid of the lowest key is stored along with the level of the tree block. This provides a hint about where in the btree the block can be found. Searches through the btree only need to look for a pointer to that block, and they stop one level higher than the level recorded in the backref.

Some btrees do not do reference counting on their extents. These include the extent tree and the tree of tree roots. Backrefs for these trees always have a generation of zero.

When a tree block is created, back references are inserted:

(root objectid, transaction id or zero, level, lowest objectid)

The level is stored in the objectid slot of the backref to differentiate between Btree back references and file data back references. The highest possible level is 255, and the lowest possible file objectid has been raised to 256. So, if the objectid field in the back reference is less than 256, it corresponds to a Btree block.

When a tree block is cow'd in a reference counted root, new back references are added for all the blocks it points to:

(root objectid, transaction id, level, lowest objectid)

Because the lowest_key_objectid and the level are just hints they are not used when backrefs are deleted. When a snapshot is created a new reference is taken directly on the root block. This means the owner field of the root block may be different from the objectid of the snapshot. So, when dropping references on tree roots, the objectid of the root structure is always used. When a backref is deleted:

if backref was for a tree root:
     root_objectid = root->root_key.objectid
else
     root_objectid = btrfs_header_owner(parent)

(root_objectid, btrfs_header_generation(parent) or zero, 0, 0)

Back Reference Key Construction

Back references have four fields, each 64 bits long. This is hashed into a single 64 bit number and placed into the key offset. The key objectid corresponds to the first byte in the extent, and the key type is set to BTRFS_EXTENT_REF_KEY.

Hash overflows on the offset field are handled by adding one to the calculated hash and searching forward. The searching stops when the correct back reference structure is found or

Snapshots and Subvolumes

Subvolumes are basically a named btree that holds files and directories. They have inodes inside the tree of tree roots and can have non-root owners and groups. Subvolumes can be given a quota of blocks, and once this quota is reached no new writes are allowed. All of the blocks and file extents inside of subvolumes are reference counted to allow snapshotting. Up to 2⁶⁴ subvolumes may be created on the FS.

Snapshots are identical to subvolumes, but their root block is initially shared with another subvolume. When the snapshot is taken, the reference count on the root block is increased, and the copy on write transaction system ensures changes made in either the snapshot or the source subvolume are private to that root. Snapshots are writable, and they can be snapshotted again any number of times. If read only snapshots are desired, their block quota is set to one at creation time.

Btree Roots

Each Btrfs filesystem consists of a number of tree roots. A freshly formatted filesystem will have roots for:

The tree of tree roots
The tree of allocated extents
The default subvolume tree

The tree of tree roots records the root block for the extent tree and the root blocks and names for each subvolume and snapshot tree. As transactions commit, the root block pointers are updated in this tree to reference the new roots created by the transaction, and then the new root block of this tree is recorded in the FS super block.

The tree of tree roots acts as a directory of all the other trees on the filesystem, and it has directory items recording the names of all snapshots and subvolumes in the FS. Each snapshot or subvolume has an objectid in the tree of tree roots, and at least one corresponding struct btrfs_root_item. Directory items in the tree map names of snapshots and subvolumes to these root items. Because the root item key is updated with every transaction commit, the directory items reference a generation number of (u64)-1, which tells the lookup code to find the most recent root available.

The extent trees are used to manage allocated space on the devices. The space available can be divided between a number of extent trees to reduce lock contention and give different allocation policies to different block ranges.

The diagram below depicts a collection of tree roots. The super block points to the root tree, and the root tree points to the extent trees and subvolumes. The root tree also has a directory to map subvolume names to struct btrfs_root_items in the root tree. This filesystem has one subvolume named 'default' (created by mkfs), and one snapshot of 'default' named 'snap' (created by the admin some time later). In this example, 'default' has not changed since the snapshot was created and so both point tree to the same root block on disk.

File:Design-roots.png

Copy on Write Logging

Data and metadata in Btrfs are protected with copy on write logging (COW). Once the transaction that allocated the space on disk has committed, any new writes to that logical address in the file or btree will go to a newly allocated block, and block pointers in the btrees and super blocks will be updated to reflect the new location.

Some of the btrfs trees do not use reference counting for their allocated space. This includes the root tree, and the extent trees. As blocks are replaced in these trees, the old block is freed in the extent tree. These blocks are not reused for other purposes until the transaction that freed them commits.

All subvolume (and snapshot) trees are reference counted. When a COW operation is performed on a btree node, the reference count of all the blocks it points to is increased by one. For leaves, the reference counts of any file extents in the leaf are increased by one. When the transaction commits, a new root pointer is inserted in the root tree for each new subvolume root. The key used has the form:

Subvolume inode number

BTRFS_ROOT_ITEM_KEY

Transaction ID

The updated btree blocks are all flushed to disk, and then the super block is updated to point to the new root tree. Once the super block has been properly written to disk, the transaction is considered complete. At this time the root tree has two pointers for each subvolume changed during the transaction. One item points to the new tree and one points to the tree that existed at the start of the last transaction.

Any time after the commit finishes, the older subvolume root items may be removed. The reference count on the subvolume root block is lowered by one. If the reference count reaches zero, the block is freed and the reference count on any nodes the root points to is lowered by one. If a tree node or leaf can be freed, it is traversed to free the nodes or extents below it in the tree in a depth first fashion.

The traversal and freeing of the tree may be done in pieces by inserting a progress record in the root tree. The progress record indicates the last key and level touched by the traversal so the current transaction can commit and the traversal can resume in the next transaction. If the system crashes before the traversal completes, the progress record is used to safely delete the root on the next mount.

Ohad Rodeh presented this reference counted snapshot algorithm at the 2007 Linux Filesystem and Storage Workshop:

Slides: http://www.cs.huji.ac.il/~orodeh/papers/LinuxFS_Workshop.pdf

Paper: http://www.cs.tau.ac.il/~ohadrode/papers/btree_TOS.pdf

The Btrfs snapshotting implementation is based on the ideas he presented.

Btrfsck

The filesystem checking utility is a crucial tool, but it can be a major bottleneck in getting systems back online after something has gone wrong. Btrfs aims to be tolerant of invalid metadata, and will avoid using metadata it determines to be incorrect. The disk format allows Btrfs to deal with most corruptions at run time, without crashing the system and without requiring offline filesystem checking.

An offline btrfsck is being developed, in part to help verify the filesystem during testing, and as an emergency tool to make sure the filesystem is safe for mounting. The existing tool only verifies the extent allocation maps, making sure that reference counts are correct and that all extents are accounted for. If the extent maps are correct, there is no risk of incorrectly writing over existing data or metadata as blocks are allocated for new use.

btrfsck is able to read metadata in roughly disk order. As it scans the btrees on disk, it collects the locations of nodes and leaves and pulls them from the disk in large sequential batches. For the most part, btrfsck is bound by the sequential read throughput of the storage, and it is able to take advantage of multi-spindle arrays. The price paid for the extra speed is more ram. Btrfsck uses about 3x more ram than ext2fsck.