Jay Taylor's notes

back to listing index

People ask: where are we with ZFS performance ? (Bizarre ! Vous avez dit Bizarre ?)

[web search]
Original source (blogs.oracle.com)
Tags: zfs filesystem blogs.oracle.com
Clipped on: 2013-05-10

People ask: where are we with ZFS performance ?

By user13278091 on nov. 04, 2008

The standard answer to any computer performance question is almost always : "it depends" which is semantically equivalent to "I don't know". The better answer is to state the dependencies.

I would certainly like to see every performance issue studied with a scientific approach. OpenSolaris and Dtrace are just incredible enablers when trying to reach root cause and finding those causes is really the best way to work toward delivering improved performance. More generally tough, people use common wisdom or possible faulty assumption to match their symptoms with that of other similar reported problems. And, as human nature has it, we'll easily blame the component we're least familiar with for problems. So we often end up with a lot of report of ZFS performance that once, drilled down, become either totally unrelated to ZFS (say HW problems) , or misconfiguration, departure from Best Practices or, at times, unrealistic expectations.

That does not mean, there are no issues. But it's important that users can more easily identify known issues, schedule for fixes, workarounds etc. So anyone deploying ZFS should really be familiar with those 2 sites : ZFS Best Practices and Evil Tuning Guide

That said, what are real commonly encountered performance problems I've seen and where do we stand ?

Writes overunning memory

That is a real problem that was fixed last March and is integrated in the Solaris U6 release. Running out of memory causes many different types of complaints and erratic system behavior. This can happen anytime a lot of data is created and streamed at rate greater than that which can be set into the pool. Solaris U6 will be an important shift for customers running into this issue. ZFS will still try to use memory to cache your data (a good thing) but the competition this creates for memory resources will be much reduced. The way ZFS is designed to deal with this contention (ARC shrinking) will need a new evaluation from the community. The lack of throttling was a great impairement to the ability of the ARC to give back memory under pressure. In the mean time lots of people are capping their arc size with success as per the Evil Tuning guide.

For more on this topic check out : The new ZFS write throttle

Cache flushes on SAN storage

This is a common issue we hit in the entreprise. Although it will cause ZFS to be totally underwhelming in terms of performance, it's interestingly not a sign of any defect in ZFS. Sadly this touches customers that are the most performance minded. The issue is somewhat related to ZFS and somewhat to the Storage. As is well documented elsewhere, ZFS will, at critical times, issue "cache flush" request to the storage elements on which is it layered. This is to take into account the fact that storage can be layered on top of _volatile_ caches that do need to be set on stable storage for ZFS to reach it's consistency points. Entreprise Storage Arrays do not use _volatile_ caches to store data and so should ignore the request from ZFS to "flush caches". The problem is that some arrays don't. This misunderstanding between ZFS and Storage Arrays leads to underwhelming performance. Fortunately we have an easy workaround that can be used to quickly identify if this is indeed the problem : setting zfs_nocacheflush (see evil tuning guide). The best workaround here is to configure the storage with the setting to indeed ignore "cache flush". And we also have the option of tuning sd.conf on a per array basis. Refer again to the evil tuning guide for more detailed information.

NFS slow over ZFS (Not True)

This is just not generally true and often a side effect of the previous Cache flush problem. People have used storage arrays to accelerate NFS for long time but failed to see the expected gains with ZFS. Many sighting of NFS problems are traced to this.

Other sightings involve common disks with volatile caches. Here the performance delta observed are rooted in the stronger semantics that ZFS offer to this operational model. See NFS and ZFS for a more detailed description of the issue.

While I don't consider ZFS as generally slow serving NFS, we did identify in recent months a condition that effects high thread count of synchronous writes (such as a DB). This issue is fixed in the Solaris 10 Update 6 (CR 6683293).

I would encourage you to be familiar to where we stand regarding ZFS and NFS because, I know of no big gapping ZFS over NFS problems (if there were one, I think I would know). People just need to be aware that NFS is a protocol need some type of accelaration (such as NVRAM) in order to deliver a user experience close to what a direct attach filesystem provides.

ZIL is a problem (Not True)

There is a wide perception that the ZIL is the source of performance problems. This is just a naive interpretation of the facts. The ZIL serves a very fundamental component of the filesystem and does that admirably well. Disabling the synchronous semantics of a filesystem will necessarely lead to higher performance in a way that is totally misleading to the outside observer. So while we are looking at further zil improvements for large scale problems, the ZIL is just not today the source of common problems. So please don't disable this unless you know what you're getting into.

Random read from Raid-Z

Raid-Z is a great technology that allows to store blocks on top of common JBOD storage without being subject to raid-5 write hole corruption (see : http://blogs.sun.com/bonwick/entry/raid_z). However the performance characteristics of raid-z departs significantly from raid-5 as to surprise first time users. Raid-Z as currently implemented spreads blocks to the full width of the raid group and creates extra IOPS during random reading. At lower loads, the latency of operations is not impacted but sustained random read loads can suffer. However, workloads that end up with frequent cache hits will not be subject to the same penalty as workloads that access vast amount of data more uniformly. This is where one truly needs to say, "it depends".

Interestingly, the same problem does not affect Raid-Z streaming performance and won't affect workloads that commonly benefit from caching. That said both random and streaming performance are perfectible and we are looking at a number different ways to improve on this situation. To better understand Raid-Z, see one of my very first ZFS entry on this topic : Raid-Z

CPU consumption, scalability and benchmarking

This is an area we will need to make more studies. With todays very capable multicore systems, there are many workloads that won't suffer from the CPU consumptions of ZFS. Most systems do not run at 100% cpu bound (being more generally constrained by disk, networks or application scalability) and the user visible latency of operations are not strongly impacted by extra cycles spent in say the ZFS checksumming.

However, this view breaks down when it comes to system benchmarking. Many benchmarks I encounter (the most crafted ones to boot) end up as host CPU efficiency benchmarks : How many Operations can I do on this system given large amount of disk and network resources while preserving some level X of response time. The answer to this question is purely the reverse of the cycles spent per operation.

This concern is more relevant when the CPU cycles spent in managing direct attach storage and filesystem is in direct competition with cycles spent in the application. This is also why database benchmarking is often associated with using raw device, a fact must less encountered in common deployment.

Root causing scalability limits and efficiency problems is just part of the never ending performance optimisation of filesystems.

Direct I/O

Directio has been a great enabler of database performance in other filesystems. The problem for me is that Direct I/O is a group of improvements each with their own contribution to the end result. Some want the concurrent writes, some wants to avoid a copy, some wants to avoid double caching, some don't know but see performance gains when turned on (some also see a degradation). I note that concurrent writes has never been a problem in ZFS and that the extra copy used when managing a cache is generally cheap considering common DB rates of access. Acheiving greater CPU efficiency is certainly a valid goal and we need to look into what is impacting this in common DB workloads. In the mean time, ZFS in OpenSolaris got a new feature to manage the cachebility of Data in the ZFS ARC. The per filesystem "primarycache" property will allow users to decide if blocks should actually linger in the ARC cache or just be transient. This will allow DB deployed on ZFS to avoid any form of double caching that might have occured in the past.

ZFS Performance is and will be a moving target for some time in the future. Solaris 10 Update 6 with a new write throttle, will be a significant change and then Opensolaris offers additional advantages. But generally just be skeptical of any performance issue that is not root caused: the problem might not be where you expect it


It'd be interesting to see how PostgreSQL fares in tandem with ZFS.

As for Oracle, there need not be any discussion.
If one wants maximum I/O performance, the only logical choice is Oracle ASM, which is interestingly enough similar to ZFS in some respects, like disk pools.

It'll be tough, if not next to impossible, to beat Oracle at their own game as far as DB I/O workloads are concerned. People are afraid of ASM, but the truth is that it's a wonderful system to use for Oracle databases, and quite similar to ZFS.

Posted by UX-admin on novembre 04, 2008 at 10:47 AM MET #

Yes this probably makes good sense. I think people drawn to ZFS for DB are probably on a different agenda than pure performance. Price / performance and consolidation might be on their mind also.

Posted by Roch on novembre 05, 2008 at 01:54 PM MET #

Please, if you scrape some time, write an article about PostgreSQL I/O performance in tandem with ZFS; also, an article on how to get the most I/O performance from PostgreSQL on ZFS would be a killer and a thriller!

You might need to team up with your colleagues from the PostgreSQL dept., but such an article would be really useful for those of us who also plan to run PostgreSQL side by side with Oracle... who better to write such an article, than the experts on the subject?

Posted by UX-admin on novembre 07, 2008 at 12:44 AM MET #

PostgreSQL can't even get close to using full disk I/O on moderately speedy arrays on linux - it is too CPU inefficient. A single 'select count(1) from table' query on a very large table will be pinned by CPU at about 200 to 350 MB/sec off disk with a 3Ghz CPU on linux -- and a decent direct attached storage array can easily do 3x to 4x that if tuned well. ZFS + OpenSolaris may do better for sequential reads or it may not. For random I/O loads with a flash L2ARC I'm sure ZFS will do much better.
Because PostgreSQL has such a CPU limit per (single threaded) query, the real world question on large systems is about I/O behavior with many concurrent queries. How the file system and OS handle concurrent I/O and the memory pressure that comes with it from both the I/O side (buffers) and the DB side (sort and aggregate space) is the most important here.
The I/O scheduler and read-ahead algorithms will have the largest effect here -- and have to be optimized for concurrent throttled access not single threaded loads. Any benchmark that doesn't have enough I/O concurrency won't get to the bottom of the question for real world use on fast I/O subsystems for databases -- and especially Postgres. ZFS could have higher CPU overhead and show slower single threaded results, but handle concurrency much better than the competition, or it could be worse all around.

Since every read is the same size in Postgres, and small (8k), a lot of the CPU overhead associated with high I/O load on PostgreSQL is due to large numbers of small read() calls -- and much more of it is internal to Postgres. If it has to scan 8GB of sequential data, it will call read() 1 million times - not the most efficient for sure, so file systems and OS's that have the most optimized small paths will probably shine.

Posted by Scott on novembre 11, 2008 at 10:59 PM MET #

Post a Comment:
Comments are closed for this entry.