Jay Taylor's notes

back to listing index

Nex7's Blog: ZFS: Read Me 1st

[web search]
Original source (nex7.blogspot.com)
Tags: zfs filesystem nex7.blogspot.com
Clipped on: 2015-09-13

Random storage industry dude's blog, mostly about ZFS and related technology.

Thursday, March 21, 2013

ZFS: Read Me 1st

Things Nobody Told You About ZFS

Yes, it's back. You may also notice it is now hosted on my Blogger page - just don't have time to deal with self-hosting at the moment, but I've made sure the old URL redirects here.

So, without further adieu..


I will be updating this article over time, so check back now and then.

Latest update 9/12/2013 - Hot Spare, 4K Sector and ARC/L2ARC sections edited, note on ZFS Destroy section, minor edit to Compression section.

There are a couple of things about ZFS itself that are often skipped over or missed by users/administrators. Many deploy home or business production systems without even being aware of these gotchya's and architectural issues. Don't be one of those people!

I do not want you to read this and think "ugh, forget ZFS". Every other filesystem I'm aware of has many and more issues than ZFS - going another route than ZFS because of perceived or actual issues with ZFS is like jumping into the hungry shark tank with a bleeding leg wound, instead of the goldfish tank, because the goldfish tank smelled a little fishy! Not a smart move.

ZFS is one of the most powerful, flexible, and robust filesystems (and I use that word loosely, as ZFS is much more than just a filesystem, incorporating many elements of what is traditionally called a volume manager as well) available today. On top of that it's open source and free (as in beer) in some cases, so there's a lot there to love.

However, like every other man-made creation ever dreamed up, it has its own share of caveats, gotchya's, hidden "features" and so on. The sorts of things that an administrator should be aware of before they lead to a 3 AM phone call! Due to its relative newness in the world (as compared to venerable filesystems like NTFS, ext2/3/4, and so on), and its very different architecture, yet very similar nomenclature, certain things can be ignored or assumed by potential adopters of ZFS that can lead to costly issues and lots of stress later.

I make various statements in here that might be difficult to understand or that you disagree with - and often without wholly explaining why I've directed the way I have. I will endeavor to produce articles explaining them and update this blog with links to them, as time allows. In the interim, please understand that I've been on literally 1000's of large ZFS deployments in the last 2+ years, often called in when they were broken, and much of what I say is backed up by quite a bit of experience. This article is also often used, cited, reviewed, and so on by many of my fellow ZFS support personnel, so it gets around and mistakes in it get back to me eventually. I can be wrong - but especially if you're new to ZFS, you're going to be better served not assuming I am. :)

1. Virtual Devices Determine IOPS

IOPS (I/O per second) are mostly a factor of the number of virtual devices (vdevs) in a zpool. They are not a factor of the raw number of disks in the zpool. This is probably the single most important thing to realize and understand, and is commonly not. 

ZFS stripes writes across vdevs (not individual disks). A vdev is typically IOPS bound to the speed of the slowest disk within it. So if you have one vdev of 100 disks, your zpool's raw IOPS potential is effectively only a single disk, not 100. There's a couple of caveats on here (such as the difference between write and read IOPS, etc), but if you just put as a rule of thumb in your head that a zpool's raw IOPS potential is equivalent to the single slowest disk in each vdev in the zpool, you won't end up surprised or disappointed.

2. Deduplication Is Not Free

Another common misunderstanding is that ZFS deduplication, since its inclusion, is a nice, free feature you can enable to hopefully gain space savings on your ZFS filesystems/zvols/zpools. Nothing could be farther from the truth. Unlike a number of other deduplication implementations, ZFS deduplication is on-the-fly as data is read and written. This creates a number of architectural challenges that the ZFS team had to conquer, and the methods by which this was achieved lead to a significant and sometimes unexpectedly high RAM requirement.

Every block of data in a dedup'ed filesystem can end up having an entry in a database known as the DDT (DeDupe Table). DDT entries need RAM. It is not uncommon for DDT's to grow to sizes larger than available RAM on zpools that aren't even that large (couple of TB's). If the hits against the DDT aren't being serviced primarily from RAM or fast SSD, performance quickly drops to abysmal levels. Because enabling/disabling deduplication within ZFS doesn't actually do anything to data already on disk, do not enable deduplication without a full understanding of its requirements and architecture first. You will be hard-pressed to get rid of it later.

3. Snapshots Are Not Backups

This is critically important to understand. ZFS has redundancy levels from mirrors and raidz. It has checksums and scrubs to help catch bit rot. It has snapshots to take lightweight point-in-time captures of data to let you roll back or grab older versions of files. It has all of these things to help protect your data. And one 'zfs destroy' by a disgruntled employee, one fire in your datacenter, one random chance of bad luck that causes a whole backplane, JBOD, or a number of disks to die at once, one faulty HBA, one hacker, one virus, etc, etc, etc -- and poof, your pool is gone. I've seen it. Lots of times. MAKE BACKUPS.

4. ZFS Destroy Can Be Painful

(9/12/2013) A few illumos-based OS are now shipping ZFS with "async destroy" feature. That has a significant mitigating impact on the below text, and ZFS destroys, while they still have to do the work, do so in the background in a less performance and stability damaging manner. However, not all shipping OS have this code in them yet (for instance, NexentaStor 3.x does not). If your ZFS has feature flag support, it might have async destroy, if it still is using the old 'zpool version' method, it probably doesn't.

Something often waxed over or not discussed about ZFS is how it presently handles destroy tasks. This is specific to the "zfs destroy" command, be it used on a zvol, filesystem, clone or snapshot. This does not apply to deleting files within a ZFS filesystem (unless that file is very large - for instance, if a single file is all that a whole filesystem contains) or on the filesystem formatted onto a zvol, etc. It also does not apply to "zpool destroy". ZFS destroy tasks are potential downtime causers, when not properly understood and treated with the respect they deserve. Many a SAN has suffered impacted performance or full service outages due to a "zfs destroy" in the middle of the day on just a couple of terabytes (no big deal, right?) of data. The truth is a "zfs destroy" is going to go touch many of the metadata blocks related to the object(s) being destroyed. Depending on the block size of the destroy target(s), the number of metadata blocks that have to be touched can quickly reach into the millions, even the hundreds of millions.

If a destroy needs to touch 100 million blocks, and the zpool's IOPS potential is 10,000, how long will that zfs destroy take? Somewhere around 2 1/2 hours! That's a good scenario - ask any long-time ZFS support person or administrator and they'll tell you horror stories about day long, even week long "zfs destroy" commands. There's eventual work that can be done to make this less painful (a major one is in the works right now) and there's a few things that can be done to mitigate it, but at the end of the day, always check the actual used disk size of something you're about to destroy and potentially hold off on that destroy if it's significant. How big is too big? That is a factor of block size, pool IOPS potential, extenuating circumstances (current I/O workload of the pool, deduplication on or off, a few other things).

5. RAID Cards vs HBA's

ZFS provides RAID, and does so with a number of improvements over most traditional hardware RAID card solutions. ZFS uses block-level logic for things like rebuilds, it has far better handling of disk loss & return due to the ability to rebuild only what was missed instead of rebuilding the entire disk, it has access to more powerful processors than the RAID card and far more RAM as well, it does checksumming and auto-correction based on it, etc. Many of these features are gone or useless if the disks provided to ZFS are, in fact, RAID LUN's from a RAID card, or even RAID0 single-disk entities offered up. 

If your RAID card doesn't support a true "JBOD" (sometimes referred to as "passthrough") mode, don't use it if you can avoid it. Creating single-disk RAID0's (sometimes called "virtual drives") and then letting ZFS create a pool out of those is better than creating RAID sets on the RAID card itself and offering those to ZFS, but only about 50% better, and still 50% worse than JBOD mode or a real HBA. Use a real HBA - don't use RAID cards.

6. SATA vs SAS

This has been a long-standing argument in the ZFS world. Simple fact is, the majority of ZFS storage appliances, most of the consultants and experts you'll talk to, and the majority of enterprise installations of ZFS are using SAS disks. To be clear, "nearline" SAS (7200 RPM SAS) is fine, but what will often get you in trouble is the use of SATA (including enterprise-grade) disks behind bad interposers (which is most of them) and SAS expanders (which almost every JBOD is going to be utilizing).

Plan to purchase SAS disks if you're deploying a 'production' ZFS box. In any decent-sized deployment, they're not going to have much of a price delta over equivalent SATA disks. The only exception to this rule is home and very small business use-cases -- and for more on that, I'll try to wax on about it in a post later.

7. Compression Is Good (Even When It Isn't)

It is the very rare dataset or use-case that I run into these days where compress=on (lzjb) doesn't make sense. It is on by default on most ZFS appliances, and that is my recommendation. Turn it on, and don't worry about it. Even if you discover that your compression ratio is nearly 0% - it still isn't hurting you enough to turn it off, generally speaking. Other compression algorithms such as gzip are another matter entirely, and in almost all cases should be strongly avoided. I do see environments using gzip for datasets they truly do not care about performance on (long-term archival, etc). In my experience if that is the case, go with gzip-9, as the performance difference between gzip-1 and gzip-9 is minimal (when then compared to lzjb or off). You're going to get the pain, so you may as well go for the best compression ratio.

8. RAIDZ - Even/Odd Disk Counts

Try (and not very hard) to keep the number of data disks in a raidz vdev to an even number. This means if its raidz1, the total number of disks in the vdev would be an odd number. If it is raidz2, an even number, and if it is raidz3, an odd number again. Breaking this rule has very little repercussion, however, so you should do so if your pool layout would be nicer by doing so (like to match things up on JBOD's, etc).

9. Pool Design Rules

I've got a variety of simple rules I tell people to follow when building zpools:
  • Do not use raidz1 for disks 1TB or greater in size.
  • For raidz1, do not use less than 3 disks, nor more than 7 disks in each vdev (and again, they should be under 1 TB in size, preferably under 750 GB in size) (5 is a typical average).
  • For raidz2, do not use less than 6 disks, nor more than 10 disks in each vdev (8 is a typical average).
  • For raidz3, do not use less than 7 disks, nor more than 15 disks in each vdev (13 & 15 are typical average).
  • Mirrors trump raidz almost every time. Far higher IOPS potential from a mirror pool than any raidz pool, given equal number of drives. Only downside is redundancy - raidz2/3 are safer, but much slower. Only way that doesn't trade off performance for safety is 3-way mirrors, but it sacrifices a ton of space (but I have seen customers do this - if your environment demands it, the cost may be worth it).
  • For >= 3TB size disks, 3-way mirrors begin to become more and more compelling.
  • Never mix disk sizes (within a few %, of course) or speeds (RPM) within a single vdev.
  • Never mix disk sizes (within a few %, of course) or speeds (RPM) within a zpool, except for l2arc & zil devices.
  • Never mix redundancy types for data vdevs in a zpool (no raidz1 vdev and 2 raidz2 vdevs, for example)
  • Never mix disk counts on data vdevs within a zpool (if the first data vdev is 6 disks, all data vdevs should be 6 disks).
  • If you have multiple JBOD's, try to spread each vdev out so that the minimum number of disks are in each JBOD. If you do this with enough JBOD's for your chosen redundancy level, you can even end up with no SPOF (Single Point of Failure) in the form of JBOD, and if the JBOD's themselves are spread out amongst sufficient HBA's, you can even remove HBA's as a SPOF.
If you keep these in mind when building your pool, you shouldn't end up with something tragic.

10. 4KB Sector Disks

(9/12/2013) The likelihood of this being an issue for you is presently very up in the air, very dependent on OS choice at the moment. There are more 4K disks out there, including some SSD's, and still some that are lying and claiming 512. However, there is also work being done to hard-code in recognition of these disks in illumos and so on. My blog post on here talking about my home BSD-based ZFS SAN has instructions on how to manually force recognition of 4K sector disks if they're not reporting on BSD, but it is not as easy on illumos derivatives as they do not have 'geom'. All I can suggest at the moment is Googling about zfs and "ashift" and your chosen OS and OS version -- not only does that vary the answer, but I myself am not spending any real time keeping track, so all I can suggest is do your own homework right now. I also do not recommend mixing -- if your pool started off with one sector size, keep it that way if you grow it or replace any drives. Do not mix/match.

There are a number of in-the-wild devices that are 4KB sector size instead of the old 512-byte sector size. ZFS handles this just fine if it knows the disk is 4K sector size. The problem is a number of these devices are lying to the OS about their sector size, claiming it is 512-byte (in order to be compatible with ancient Operating Systems like Windows 95); this will cause significant performance issues if not dealt with at zpool creation time.

11. ZFS Has No "Restripe"

If you're familiar with traditional RAID arrays, then the term "restripe" is probably in your vocabulary. Many people in this boat are surprised to hear that ZFS has no equivalent function at all. The method by which ZFS delivers data to the pool has a long-term equivalent to this functionality, but not an up-front way nor a command that can be run to kick off such a thing. 

The most obvious task where this shows up is when you add a vdev to an existing zpool. You could be forgiven to expect that the existing data in the pool would slide over and all your vdevs would end up of roughly equal used size (rebalancing is another term for this), since that's what a traditional RAID array would do. ZFS? It won't. That data balancing will only come as an indirect result of rewrites. If you only ever read from your pool, it'll never happen. Bear this in mind when designing your environment and making initial purchases. It is almost never a good idea, performance wise, to start off with a handful of disks if within a year or two you expect to grow that pool to a significant larger size, adding in small numbers of disks every X weeks/months.

12. Hot Spares

Don't use them. Pretty much ever. Warm spares make sense in some environments. Hot spares almost never make sense. Very often it makes more sense to include the disks in the pool and increase redundancy level because of it, than it does to leave them out and have a lower redundancy level.

For a bit of clarification, the main reasoning behind this has to do with the present method hot spares are handled by ZFS & Solaris FMA and so on - the whole environment involved in identifying a failed drive and choosing to replace it is far too simplistic to be useful in many situations. For instance, if you create a pool that is designed to have no SPOF in terms of JBOD's and HBA's, and even go so far as to put hot spares in each JBOD, the code presently in illumos (9/12/2013) has nothing in it to understand you did this, and it's going to be sheer chance if a disk dies and it picks the hot spare in the same JBOD to resilver to. It is more likely it just picks the first hot spare in the spares list, which is probably in a different JBOD, and now your pool has a SPOF.

Further, it isn't intelligent enough to understand things like catastrophic loss -- say you again have a pool setup where the HBA's and JBOD's are set up for no SPOF, and you lose an HBA and the JBOD connected to it - you had 40 drives in mirrors, and now you are only seeing half of each mirror -- but you also have a few hot spares in that JBOD, say 2. Now, obviously, picking 2 random mirrors and starting to resilver them from the hot spares still visible is silly - you lost a whole JBOD, all your mirrors have gone to single drive, and the only logical solution is getting the other JBOD back on (or if it somehow went nuts, a whole new JBOD full of drives and attach them to the existing mirrors). Resilvering 2 of your 20 mirror vdevs to hot spares in the still-visible JBOD is just a waste of time at best, and dangerous at worst, and it's GOING to do it.

What I tend to tell customers when the hot spare discussion comes up is actually to start with a question. The multi-part question is this: how many hours could possibly pass before your team is able to remotely login to the SAN after receiving an alert that there's been a disk loss event, and how many hours could possibly pass before your team is able to physically arrive to replace a disk after receiving an alert that there's been a disk loss event?

The idea, of course, is to determine if hot spares are seemingly required, or if warm spares would do, or if cold spares are acceptable. Here's the ruleset in my head that I use after they tell me the answers to that question (and obviously, this is just my opinion on the numbers to use):

  • Under 24 hours for remote access, but physical access or lack of disks could mean physical replacement takes longer
    • Warm spares
  • Under 24 hours for remote access, and physical access with replacement disks is available by that point as well
    • Pool is 2-way mirror or raidz1 vdevs
      • Warm spares
    • Pool is >2-way mirror or raidz2-3 vdevs
      • Cold spares
  • Over 24 hours for remote or physical access
    • Hot spares start to become a potential risk worth taking, but serious discussion about best practices and risks has to be had - often is it's 48-72 hours as the timeline, warm or cold spares may still make sense depending on pool layout; > 72 hours to replace is generally where hot spares become something of a requirement to cover those situations where they help, but at that point a discussion needs to be had on customer environment that there's a > 72 hour window where a replacement disk isn't available
I'd have to make one huge bullet list to try to cover every possible contingency here - each customer is unique, but this is some general guidelines. Remember, it takes a significant amount of time to resilver a disk, and so adding in X amount of additional hours is not adding a lot of risk, especially for 3-way or higher mirrors and raidz2-3 vdevs which can already handle multiple failures.

13. ZFS Is Not A Clustered Filesystem

I don't know where this got started, but at some point, something must have been said that has led some people to believe ZFS is or has clustered filesystem features. It does not. ZFS lives on a single set of disks in a single system at a time, period. Various HA technologies have been developed to "seamlessly" move the pool from one machine to another in case of hardware issues, but they move the pool - they don't offer up the storage from multiple heads at once. There is no present (9/12/2013) method of "clustered ZFS" where the same pool is offering up datasets from multiple physical machines. I'm aware of no work to change this.

14. To ZIL, Or Not To ZIL

This is a common question - do I need a ZIL (ZFS Intent Log)? So, first of all, this is the wrong question. In almost every storage system you'll ever build utilizing ZFS, you will need and will have a ZIL. The first thing to explain is that there is a difference between the ZIL and a ZIL (referred to as a log or slog) device. It is very common for people to call a log device a "ZIL" device, but this is wrong - there is a reason ZFS' own documentation always refers to the ZIL as the ZIL, and a log device as a log device. Not having a log device does not mean you do not have a ZIL!

So with that explained, the real question is, do you need to direct those writes to a separate device from the pool data disks or not? In general, you do if one or more of the intended use-cases of the storage server are very write latency sensitive, or if the total combined IOPS requirement of the clients is approaching say 30% of the raw pool IOPS potential of the zpool. In such scenarios, the addition of a log vdev can have an immediate and noticeable positive performance impact. If neither of those is true, it is likely you can just skip a log device and be perfectly happy. Most home systems, for example, have no need of a log device and won't miss not having it. Many small office environments using ZFS as a simple file store will also not require it. Larger enterprises or latency-sensitive storage will generally require fast log devices.

15. ARC and L2ARC

(9/12/2013) There are presently issues related to memory handling and the ARC that have me strongly suggesting you physically limit RAM in any ZFS-based SAN to 128 GB. Go to > 128 GB at your own peril (it might work fine for you, or might cause you some serious headaches). Once resolved, I will remove this note.

One of ZFS' strongest performance features is its intelligent caching mechanisms. The primary cache, stored in RAM, is the ARC (Adaptive Replacement Cache). The secondary cache, typically stored on fast media like SSD's, is the L2ARC (second level ARC). Basic rule of thumb in almost all scenarios is don't worry about L2ARC, and instead just put as much RAM into the system as you can, within financial realities. ZFS loves RAM, and it will use it - there is a point of diminishing returns depending on how big the total working set size really is for your dataset(s), but in almost all cases, more RAM is good. If your use-case does lend itself to a situation where RAM will be insufficient and L2ARC is going to end up being necessary, there are rules about how much addressable L2ARC one can have based on how much ARC (RAM) one has.

16. Just Because You Can, Doesn't Mean You Should

ZFS has very few limits - and what limits it has are typically measured in megazillions, and are thus unreachable with modern hardware. Does that mean you should create a single pool made up of 5,000 hard disks? In almost every scenario, the answer is no. The fact that ZFS is so flexible and has so few limits means, if anything, that proper design is more important than in legacy storage systems. It is a truism that in most environments that need lots of storage space, it is likely more efficient and architecturally sound to find a smaller-than-total break point and design systems to meet that size, then build more than one of them to meet your total space requirements. There is almost never a time when this is not true.

It is very rare for a company to need 1 PB of space in one filesystem, even if it does need 1 PB in total space. Find a logical separation and build to meet it, not go crazy and try to build a single 1 PB zpool. ZFS may let you, but various hardware constraints will inevitably doom this attempt or create an environment that works, but could have worked far better at the same or even lower cost.

Learn from Google, Facebook, Amazon, Yahoo and every other company with a huge server deployment -- they learned to scale out, with lots of smaller systems, because scaling up with giant systems not only becomes astronomically expensive, it quickly ends up being a negative ROI versus scaling out.

17. Crap In, Crap Out

ZFS is only as good as the hardware it is put on. Even ZFS can corrupt your data or lose it, if placed on inferior components. Examples of things you don't want to do if you want to keep your data intact include using non-ECC RAM, using non-enterprise disks, using SATA disks behind SAS expanders, using non-enterprise class motherboards, using a RAID card (especially one without a battery), putting the server in a poor environment for a server to be in, etc.


  1. Image (Asset 1/9) alt=