Jay Taylor's notes

How ZFS continues to be better than btrfs â Rudd-O.com in English

Tags: linux zfs filesystem btrfs rudd-o.com

Clipped on: 2013-07-24

How ZFS continues to be better than btrfs

ZFS keeps winning over btrfs on many fronts. Here's a short list that explains exactly how.

We keep hearing about the wonders of btrfs. btrfs is algorithmically better, btrfs has features that ZFS does not have, btrfs is going to win over ZFS at some unspecified point in the future.

The reality is that, today, ZFS is way better than btrfs in a number of areas, in very concrete ways that make using ZFS a joy and make using btrfs a pain, and make ZFS the only choice for many workloads. Even in single-disk laptop or desktop configurations, ZFS has a number of advantages over btrfs that will make your life much easier.

Here, I will examine in detail all the ways in which ZFS is better than btrfs. Fundamentally, all of these advantages are practical consequences of user-centered design decisions that -- while technically possible to implement in btrfs -- were completely absent during the conception of btrfs. We can only hope that, with a bit of luck, the btrfs developers will eventually fix these problems and produce a superior file system.

For the record, everything I've discussed here is available for Linux, both as the ZFS on Linux project and the ZFS-FUSE project as well.

ZFS organizes file systems as a flexible tree

btrfs does not allow you to organize subvolumes. Subvolumes appear where you created them (whether through creation or snapshotting), and you can't move or rename them. That's it.

ZFS organizes file systems within a pool as a flexible tree. Of course -- just like with btrfs -- when you create a file system, the file system you created is always a child file system of another file system. But you can rename and move them around freely, and you can reconfigure how they attach to your VFS anywhere, including points outside of the mount point where the pool was attached originally. Also, snapshots do not attach themselves to the VFS automatically.

What do these facts allow you to do?

You can organize your ZFS file systems in a tree, separate from snapshots themselves. This comes more naturally because the VFS is already a tree. You can't do that with btrfs.
You can mount a ZFS file system without mounting its children. What's more, mounting a ZFS file system with its children, whether they are file systems or snapshots, doesn't force you to see the children mounted underneath -- you can have the children be mounted somewhere else. With ZFS, you choose where a file system is attached to the VFS tree; the initial mount point of the file system does not bind your wishes in any way.
You can apply operations to an entire tree of file systems at once. You have to apply those operations separately to each btrfs subvolume.
You can set policy for an entire subtree of file systems at once. You have to manually set policy for each btrfs subvolume.
You can manage usage of disk space hierarchically. You can't even view per-subvolume disk space usage on btrfs.
You can view the entire tree of all active pools on your system with a single command. With btrfs, you have to first discover which btrfs file systems are mounted, and then use one command per tree, which may not show you the entire tree (e.g. if you have mounted a subvolume rather than the root subvolume).

At first glance, these may sound like whining about "the way btrfs does things". But all we have to do to understand they really are not, is ask ourselves: hy are these ZFS affordances important?

Because they allow you to discover, manage and tune data storage and organization with very little effort. Sure, if you have two subvolumes vs. two ZFS file systems, it's not that important to have them organized in a tree, but if you have two hundred, then suddenly the ability to operate on whole swaths of file systems becomes vital. And there are very good reasons you might want to have lots of file systems:

Separating data according to security policies.
Giving different quotas depending on the directory.
Segregating data according to policy, churn, backup lifecycle, et cetera.
Giving each user his own home directory.

These are all things you might want to do, but you'd have to ponder whether it is worth doing given the added effort to do them. ZFS makes it effortless to use these features, without adding any additional work for you. Thus, ZFS makes it possible to do what you want, whereas before with btrfs you would have said "naaah, that's too much work for very little benefit". In this sense, ZFS offers advantages that make btrfs look as cumbersome as LVM in some scenarios.

File system operations in ZFS can apply recursively

Want to snapshot a ZFS file system and all its children together? It's one command. You can snapshot fifty or five thousand file systems this way. Not possible with btrfs -- the snapshot operation applies exclusively to the file system you snapshotted. If you have / and /var is a subvolume of /, snapshotting of / does not snapshot /var with it.

Want to relocate a subtree of twenty file systems into a new mountpoint? Again, one command changing one property accomplishes it. Not possible with btrfs.

Want to backup a specific subtree or mirror it to another machine? No problem. Again, one command.

You can't do any of these things with btrfs. Which means you will think twice about creating many btrfs subvolumes. Which means you won't benefit from the advantages of btrfs subvolumes as much.

Policy set on ZFS file systems is inherited by their children

One powerful feature of ZFS file systems is that any property set on it -- including the mount point -- will inherit to its children by default.

Say you want compression on all your file systems but a specific one. No problem -- enable it on the root and disable it on the one you don't want to compress data. You can even set different compression algorithms, and the policy will inherit properly. You can't do any of that with btrfs.

You want to relocate a specific subtree of file systems somewhere else? No problem -- change the mountpoint property on the parent of all those file systems, and ZFS will remount the parent in the new location, and all its children inside the new location too.

ZFS auto-mounts file systems by default

btrfs does not automatically mount file systems. For each subvolume you create, you have to register it in the fstab so it is mounted on boot, unless it's a direct child of a mounted subvolume -- in which case you can't change its mount point!

This is not needed with ZFS. ZFS automatically mounts each file system based on the mountpoint property assigned to it. Since the mountpoint property is inherited too (but overridable), all child file systems are mounted at the right place as well.

This means you can forget about having to change fstab with ZFS, if you so choose. Creating a file system? No problem, ZFS will mount it in the right place. Destroying a file system or a whole subtree? No problem, ZFS will unmount them for you. Relocating a file system subtree to another graft point in your VFS? Easy peasy -- change one property in one file system and you're good to go. At no point will you be required to change fstab or issue many mount and umount commands.

Of course, you can override that. You can relocate specific portions of your ZFS tree of file systems to a different mountpoint. Simply adjust the mountpoint property on the file system you want to move, and ZFS will unmount it and remount it in the new destination directory. If the file system has any children, they will also be unmounted and remounted as subdirectories of the new location.

Heck, with the built-in dracut and systemd support in ZFS (my tree contains it), you can even boot your operating system on a ZFS file system root, and you won't have to register any file system (except perhaps for the root one) in fstab. Using my tree, systemd will automatically take care of discovering all pools and mounting all file systems on boot, in the correct order (even if you have other file systems in fstab) and in parallel. And yes, root file system on ZFS works fine. You can kiss maintenance of fstab goodbye. I have written a guide to do exactly this.

And, if you don't like this, you can even tell ZFS that certain file systems are legacy and should not be mounted automatically, then use fstab for them.

Not having the auto-mount feature or inheritable mountpoint properties in btrfs, again, means that managing many subvolumes is cumbersome at best. Which means you won't take advantage of subvolumes in practice.

ZFS tracks used space per file system

btrfs cannot show you how much disk space is being referred to by each subvolume, as its used and free space tracking is only per pool. The only way to see how much space a particular subvolume is taking is to use du, which gets slower the bigger the subvolume gets.

ZFS can, and it will do so instantaneously, regardless of the size or amount of your filesystems. The command zfs list will show you how much disk space is being referred to by each dataset, by its snapshots, and by all child snapshots of the dataset.

This is good because you can designate different areas or categories, mounted at different points of the VFS structure, and then rapidly see which ones are taking the most space and how intensive is the "churn" (change in time), simply by visually comparing the used space with the used space of previous snapshots of your datasets. Want to know how much space your /var is taking? As long as you created it as a ZFS file system, no problem -- it will be instantaneous. Want to know how much its data has changed since the last snapshot? zfs list will tell you that.

Suppose you have:

pool/shared
pool/shared/Movies
pool/shared/Music
pool/shared/TV shows

A single zfs list command will let you know, instantaneously:

How much disk space you have free
How much disk space Movies, Music and TV shows take
How much disk space the whole of pool/shared takes
How much disk space is being used exclusively by pool/shared but not by any of its children
How much disk space the snapshots of each one of those file systems are taking

In a very easy to read list.

This is made possible because of the "virtual memory-like" DMU abstraction layer in ZFS.

Oh, I almost forgot: df actually does work properly with ZFS file systems.

ZFS distinguishes snapshots from file systems

btrfs pollutes the file system namespace by keeping snapshots and file systems in the same location. A snapshot appears as a "copy-on-write" (and writable!) directory which is a sibling of the directory containing the subvolume you just snapshotted. What happens if you don't like that, or you would like the snapshot to be invisible to users? What happens is that you're screwed -- you can't prevent that from happening. Cue ten /home directories!

In contrast, ZFS intelligently separates snapshots from file systems, which makes it possible for ZFS not to list them by default, or to list them separately from file systems. The name of the snapshot is distinct and separate from the name of the file system that was snapshotted. ZFS also doesn't auto-mount snapshots or allow modifications to it, unless you request otherwise. To get the snapshot to be writable (which you presumably might want in certain circumstances), you have to explicitly clone it into the file system namespace.

This lack of clutter makes ZFS more efficacious for you to manage large numbers of snapshots with large numbers of file systems, and less likely for you to touch data you snapshotted for backup purposes.

ZFS lets you specify compression and other properties per file system subtree

btrfs only lets you specify compression for the whole pool, or for individual files, or for specific file systems (and only as a mount option). ZFS lets you specify compression as an intrinsic property of a file system. Of course, coupled with property inheritance, this means you can compress a whole subtree, or compress the entire pool but for a specific subtree, with one or two commands.

Same goes for every other file system tuning option, like the block size for I/O or mount options like noatime.

ZFS is more stable

ZFS has simply had orders of magnitude more testing. It was written as a user-space program to begin with, and it's runnable for testing purposes as a user-space program, which means that fast automated testing was there from day one.

ZFS has RAIDZ

btrfs has nothing equivalent to RAID5. You can only do RAID10, RAID1 and RAID0. ZFS has them all plus a RAID5 implementation called RAIDZ that is invulnerable to the write hole problem (which will make you lose your entire array under certain circumstances).

ZFS has send and receive

ZFS lets you mirror an entire pool or subtrees of that pool by incrementally transferring changed data between snapshots. btrfs does not have that yet, though it's in the works. It remains to be seen if btrfs's implementation of send and receive will be as easy to use as ZFS's implementation is -- judging from current facts, it is unlikely.

ZFS is better documented

There is a wealth of documentation on ZFS. Its man pages are impeccable and explain very well what ZFS is, the core concepts, how they relate with each other, and details on the behavior of each command. The documentation is written in such a way that you don't have to piece facts together to get a comprehensive view of the whole subsystem, and you won't have as many doubts as to the effect of each action you take.

ZFS uses atomic writes and barriers

Every write in ZFS is an atomic transaction, because ZFS makes use of barriers to complete transactions. This prevents reordering of writes that might cause inconsistencies due to incomplete writes. This also makes it unnecessary to disable the disk write cache, an operation that would reduce your disk subsystem write performance substantially.

You can yank the power cord of your machine -- you will never lose a single byte of anything committed to the disk. ZFS is always consistent on disk, and never trusts any faulty data (that might have become damaged because of hardware issues). This is why people say ZFS requires no fsck to check for consistency, and faulty data never causes kernel panics.

We do not yet know with certainity if that is the case with btrfs, but -- unlike with ZFS -- we do know of many btrfs pools that have gone bad, and of btrfs inconsistencies that have caused kernel panics.

ZFS will actually tell you what went bad in no uncertain terms, and help you fix it

ZFS includes an administrative function in zpool status, that will let you check the status of your pool and its component devices.

This command will list any damage that your pool has sustained, the type of damage (read, write or checksum error), and which device suffered the damage. In addition to that, ZFS will tell you if you actually lost any data due to the damage, and which files were lost. If no data was lost as a consequence of the damage, because ZFS managed to repair the damage, ZFS will tell you "everything is okay, I have repaired the data, but this device is still faulty".

The same command will also tell you which devices are online, offline, faulted, and spares for your pool. If any device is offline or faulted, ZFS will explain why (because of absence of the device, deadlock, or too many errors detected from the device).

Finally, ZFS will give you helpful hints right there in the command line, informing you of the best course of action and linking you to an extended explanation of what happened.

btrfs has nothing of the sort. You are forced to stalk the kernel ring buffer if you want to find out about such things.

ZFS increases random read performance with advanced memory and disk caches

Unlike btrfs, which is limited to the system's cache, ZFS will take advantage of fast SSDs or other fast memory technology devices, as a second level cache (L2ARC). This cache is very effective because it is geared toward serving random reads at very high performance (so large streaming reads won't evict or trash other hot objects in the cache).

The L2ARC also cooperates with the very effective ARC (adaptive replacement cache) in main memory, to prevent unnecessary data duplication and to keep the hottest data in the fastest location. Finally, the L2ARC won't destroy your cache device with unlimited writes -- write speeds to the L2ARC are judiciously limited so the memory cells in your cache device won't sustain inordinate wear.

btrfs has nothing even remotely close to this. You want to cache a large working set being consumed by random reads, but you don't have a machine that will accept 256 GB RAM? Sorry, you're out of luck.

ZFS increases random and synchronous write performance with log devices

ZFS has the ability to designate a fast SSD as a SLOG device. This SLOG device lets clients complete synchronous writes instantaneously and return to the clients immediately, without bogging down the main disks (which would be slow under any copy-on-write file system because of random seeks). After a short period, the transactions committed to the SLOG device are committed to the rotating disks in streaming fashion. The SLOG device is, effectively, a very effective mechanism to commit a huge volume of small synchronous transactions, which would destroy IOPS performance, in a sequentialized fashion.

btrfs? Nothing like that there, move along.

ZFS supports thin-provisioned virtual block devices

Virtualizing a large number of machines? Testing other file systems? Consolidating many machines in a single storage unit?

No problem. ZFS (in its ZFS on Linux incarnation) lets you allocate block devices called ZVOLs backed by portions of your pool. ZVOLs are thin-provisioned, so you can create any number of them, and create any kind of file systems on top of them. TRIM commands from clients using those volumes release unused space on the ZVOLs back to the pool, so you can continue enjoying the advantages of thin provisioning. What's even better, ZFS will let you share ZVOLs using iSCSI, no extra configuration or mucking around with configuration files required.

Of course, all of the benefits of ZFS (including deduplication, snapshotting, and incremental send/receive) are available to be used with ZVOLs. A client of a ZVOL writes a sector -- whether iSCSI, or a VM, or a local file system created on top of the volume -- and that sector will get compressed, deduplicated, and backed up using your established snapshot and send / receive policy.

To even come close to this kind of thing on btrfs, you would have to create big-ass qcow or other block storage backing files, then turn them into block devices using iSCSI or other technologies like qemu. Which means that every disk read or write incurs an extra 2 context switches. And don't forget the editing of configuration files...

ZFS helps you share

Sharing data?

No matter the mechanism, ZFS will help you there. ZFS will share, upon your request, file systems over CIFS (SAMBA), NFS or iSCSI without any special kind of configuration:

zfs set sharesmb=on pool/shared/Movies
zfs set sharenfs=on pool/shared/Movies
zfs set shareiscsi=on pool/virtualmachines/fedorarootfilesystem

btrfs can't do that. You must manually alter daemon configuration and reload services if you want to share something. ZFS does it automatically for you.

Don't be a misanthrope. Share!

ZFS can save you terabytes by deduplicating your data

Yes, it's memory-hungry, and yes, it can reduce your write performance a bit. But at least you can use it.

btrfs? Nope.

Some of this information was culled from the ZFS Gentoo overlay FAQ. Many thanks to Richard Yao.

Jay Taylor's notes