Jay Taylor's notes

back to listing index

ZFS Evil Tuning Guide - Siwiki

[web search]

Original source (www.solarisinternals.com)

Tags: zfs filesystem www.solarisinternals.com

Clipped on: 2013-05-10

ZFS Evil Tuning Guide

[hide]

[edit] Overview

[edit] Tuning is Evil

Tuning is often evil and should rarely be done.

First, consider that the default values are set by the people who know the most about the effects of the tuning on the software that they supply. If a better value exists, it should be the default. While alternative values might help a given workload, it could quite possibly degrade some other aspects of performance. Occasionally, catastrophically so.

Over time, tuning recommendations might become stale at best or might lead to performance degradations. Customers are leery of changing a tuning that is in place and the net effect is a worse product than what it could be. Moreover, tuning enabled on a given system might spread to other systems, where it might not be warranted at all.

Nevertheless, it is understood that customers who carefully observe their own system may understand aspects of their workloads that cannot be anticipated by the defaults. In such cases, the tuning information below may be applied, provided that one works to carefully understand its effects.

If you must implement a ZFS tuning parameter, please reference the URL of this document:

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

[edit] Review ZFS Best Practices Guide

On the other hand, ZFS best practices are things we encourage people to use. They are a set of recommendations that have been shown to work in different environments and are expected to keep working in the foreseeable future. So, before turning to tuning, make sure you've read and understood the best practices around deploying a ZFS environment that are described here:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

[edit] Identify ZFS Tuning Changes

The syntax for enabling a given tuning recommendation has changed over the life of ZFS releases. So, when upgrading to newer releases, make sure that the tuning recommendations are still effective. If you decide to use a tuning recommendation, reference this page in the /etc/system file or in the associated script.

[edit] The Tunables

In no particular order:

[edit] Tuning ZFS Checksums

End-to-end checksumming is one of the great features of ZFS. It allows ZFS to detect and correct many kinds of errors other products can't detect and correct. Disabling checksum is, of course, a very bad idea. Having file system level checksums enabled can alleviate the need to have application level checksums enabled. In this case, using the ZFS checksum becomes a performance enabler.

The checksums are computed asynchronously to most application processing and should normally not be an issue. However, each pool currently has a single thread computing the checksums (RFE below) and it's possible for that computation to limit pool throughput. So, if disk count is very large (>> 10) or single CPU is weak (< Ghz), then this tuning might help. If a system is close to CPU saturated, the checksum computations might become noticeable. In those cases, do a run with checksums off to verify if checksum calculation is a problem.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Tuning_ZFS_Checksums

Verify the type of checksum used:

zfs get checksum <filesystem>

Tuning is achieved dynamically by using:

zfs set checksum=off <filesystem>

And reverted:

zfs set checksum='on | fletcher2 | fletcher4 | sha256' <filesystem>

Fletcher2 checksum has been observed to consume roughly 1Ghz of a CPU when checksumming 500 MB per second.

[edit] RFEs

6533726 single-threaded checksum & raidz2 parity calculations limit write bandwidth on thumper (Fixed in Nevada, build 79 and Solaris 10 10/08)

[edit] Limiting the ARC Cache

The ARC is where ZFS caches data from all active storage pools. The ARC grows and consumes memory on the principle that no need exists to return data to the system while there is still plenty of free memory. When the ARC has grown and outside memory pressure exists, for example, when a new application starts up, then the ARC releases its hold on memory. ZFS is not designed to steal memory from applications. A few bumps appeared along the way, but the established mechanism works reasonably well for many situations and does not commonly warrant tuning.

However, review the following situations:

If a future memory requirement is significantly large and well defined, then it can be advantageous to prevent ZFS from growing the ARC into it. For example, if we know that a future application requires 20% of memory, it makes sense to cap the ARC such that it does not consume more than the remaining 80% of memory.

Some applications include free-memory checks and refuse to start if there is not enough RAM available - even though the ARC would release its memory based on applications' requests to the OS kernel for memory. Sometimes the ARC can be too slow to release the memory, and better-behaving applications (without preliminary checks) experience longer delays when requesting memory.

If the application is a known consumer of large memory pages, then again limiting the ARC prevents ZFS from breaking up the pages and fragmenting the memory. Limiting the ARC preserves the availability of large pages.

If dynamic reconfiguration of a memory board is needed (supported on certain platforms), then it is a requirement to prevent the ARC (and thus the kernel cage) to grow onto all boards.

If an application's demand for memory fluctuates, the ZFS ARC caches data at a period of weak demand and then shrinks at a period of strong demand. However, on large memory systems, ZFS does not shrink below the value of arc_c_min or currently, at approximately 12% of memory. If an application's height of memory usage requires more than 88% of system memory, tuning arc_c_min would be currently required until a better default is selected as part of 6855793.

For theses cases, you might consider limiting the ARC. Limiting the ARC will, of course, also limit the amount of cached data and this can have adverse effects on performance. No easy way exists to foretell if limiting the ARC degrades performance.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache

As many other Solaris tunables, ARC size limits can be configured via /etc/system to be applied at every boot (for newer Solaris and OpenSolaris releases), or dynamically reconfigured on a live system with the mdb debugger. Methods to do so have changed during development of OpenSolaris and further Solaris 10 releases, with specifics provided below. ARC size configuration via mdb was the only option for initial OS releases, and was wrapped in scripts like those provided below.

There are many parameters which actually control the ARC size, based on a single zfs_arc_max limit value desired by the system's administrator (or, by default, derived by ZFS based on system RAM size). When Solaris is booting, such ARC parameters as p, c, c_min and c_max are initialized, and subsequent changes to zfs_arc_max have no direct effect.

On a running system you can only change the ARC maximum size by using the mdb command. Because the system is already booted, the ARC init routine has already executed and other ARC size parameters have already been set based on the default c_max size. Therefore, you should tune the arc.c and arc.p values, along with arc.c_max, using the formula:

arc.c = arc.c_max
arc.p = arc.c / 2

[edit] Current Solaris 10 Releases, Solaris Nevada Releases and newer OpenSolaris (OpenIndiana, Illumos) Releases

[edit] Static change via /etc/system

This syntax is provided starting in the Solaris 10 8/07 release and Nevada (OpenSolaris build 51) release.

For example, if an application needs 5 GB of available memory on a system with 36-GB of memory, you could set the arc maximum to 30 GB, (0x780000000 or 32212254720 bytes). Set the zfs:zfs_arc_max parameter in the /etc/system file:

set zfs:zfs_arc_max = 0x780000000

set zfs:zfs_arc_max = 32212254720

You have to reboot the system for this option to take effect.

[edit] Dynamic change via mdb

You can only change the ARC maximum size dynamically by using the mdb command.

For example, to the set the ARC parameters to small values, such as arc_c_max to 512MB, and complying with the formula above (arc.c_max to 512MB, and arc.p to 256MB), on Current Solaris Releases, use the following syntax:

# mdb -kw
 > arc_stats::print -a arcstat_p.value.ui64 arcstat_c.value.ui64 arcstat_c_max.value.ui64
ffffffffc00df578 arcstat_p.value.ui64 = 0xb75e46ff
ffffffffc00df5a8 arcstat_c.value.ui64 = 0x11f51f570
ffffffffc00df608 arcstat_c_max.value.ui64 = 0x3bb708000

 > ffffffffc00df578/Z 0x10000000
arc_stats+0x500:0xb75e46ff        = 0x10000000
 > ffffffffc00df5a8/Z 0x20000000
arc_stats+0x530:0x11f51f570        = 0x20000000
 > ffffffffc00df608/Z 0x20000000
arc_stats+0x590:  0x11f51f570        = 0x20000000

You should verify the values have been set correctly by examining them again in mdb (using the same print command in the example). You can also monitor the actual size of the ARC to ensure it has not exceeded. For example, to display the current ARC size in decimal:

# echo "arc_stats::print -d arcstat_size.value.ui64" | mdb -k
arcstat_size.value.ui64 = 0t239910912

Here is a perl script that you can call from an init script to configure your ARC on boot or otherwise on-demand with the above guidelines on Current Solaris Releases:

#!/bin/perl

### http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache
### arc_tune.pl updated for OpenSolaris post-b51 and Solaris 10U4+, renamed arc_tune_new.pl
### updated by Jim Klimov; tested in OpenIndiana oi_148a, Solaris 10 10/09 SPARC, Solaris 10 10/08 x86_64

use strict;
my $arc_max = shift @ARGV;
my $testmode = shift @ARGV;
if ( !defined($arc_max) ) {
        print STDERR "usage: arc_tune_new.pl <arc max> [-n]\n";
        print STDERR "  arc_max ZFS ARC c_max (in bytes)\n";
        print STDERR "  -n      Don't change kernel params, test only\n";
        exit -1;
}
if ( !defined($testmode) ) {
        $testmode = 0;
} else { $testmode = 1; }

$| = 1;
use IPC::Open2;
my %syms;
my $mdb = "/usr/bin/mdb";
open2(*READ, *WRITE,  "$mdb -kw") || die "cannot execute mdb";
printf STDOUT "Requested arc_max: %s bytes = 0x%x\n", $arc_max, $arc_max;
printf STDOUT "Test mode: %d\n", $testmode;

print WRITE "arc_stats::print -a arcstat_p.value.ui64 arcstat_c.value.ui64 arcstat_c_max.value.ui64\n";
print WRITE "arc_stats/P\n";    ### Have MDB output paddinf - a line different
                                ### from the expected ADDR NAME = VAL pattern
while(<READ>) {
        my $line = $_;

        if ( $line =~ /^ *([a-f0-9]+) (.*\.?.*) =/ ) {
                print STDERR "=== FOUND:  @ $1\t= $2\n";
                $syms{"$2"} = $1;
        } else { last; }
}
<READ>; ### Buffer the second line of padding output
print STDERR "=== Done listing vars\n";

printf STDOUT "Checking ".($testmode?"":"and replacing ")."kernel variables:\n";
# set c & c_max to our max; set p to max/2
if ( $syms{"arcstat_p.value.ui64"} ne "" ) {
        printf STDOUT "p\t @ %s\t= ", $syms{"arcstat_p.value.ui64"};
        printf WRITE "%s/P\n", $syms{"arcstat_p.value.ui64"};
        print scalar <READ>;
        if (!$testmode) {
                printf WRITE "%s/Z 0x%x\n", $syms{"arcstat_p.value.ui64"}, ( $arc_max / 2 );
                print scalar <READ>;
        }
}

if ( $syms{"arcstat_c.value.ui64"} ne "" ) {
        printf STDOUT "c\t @ %s\t= ", $syms{"arcstat_c.value.ui64"};
        printf WRITE "%s/P\n", $syms{"arcstat_c.value.ui64"};
        print scalar <READ>;
        if (!$testmode) {
                printf WRITE "%s/Z 0x%x\n", $syms{"arcstat_c.value.ui64"}, $arc_max;
                print scalar <READ>;
        }
}

if ( $syms{"arcstat_c_max.value.ui64"} ne "" ) {
        printf STDOUT "c_max\t @ %s\t= ", $syms{"arcstat_c_max.value.ui64"};
        printf WRITE "%s/P\n", $syms{"arcstat_c_max.value.ui64"};
        print scalar <READ>;
        if (!$testmode) {
                printf WRITE "%s/Z 0x%x\n", $syms{"arcstat_c_max.value.ui64"}, $arc_max;
                print scalar <READ>;
        }
}

[edit] Earlier Solaris Releases

You can only change the ARC maximum size by using the mdb command.

For example, to the set the ARC parameters to small values, such as arc_c_max to 512MB, and complying with the formula above (arc.c_max to 512MB, and arc.p to 256MB), on Earlier Solaris Releases, use the following syntax:

# mdb -kw
 > arc::print -a p c c_max
ffffffffc00b3260 p = 0xb75e46ff
ffffffffc00b3268 c = 0x11f51f570
ffffffffc00b3278 c_max = 0x3bb708000

 > ffffffffc00b3260/Z 0x10000000
ffffffffc00b3260:  0xb75e46ff        = 0x10000000
 > ffffffffc00b3268/Z 0x20000000
ffffffffc00b3268:  0x11f51f570        = 0x20000000
 > ffffffffc00b3278/Z 0x20000000
ffffffffc00b3278:  0x11f51f570        = 0x20000000

# echo "arc::print -d size" | mdb -k

The above command displays the current ARC size in decimal.

You can also use the arcstat script available at http://blogs.sun.com/realneel/entry/zfs_arc_statistics to check the arc size as well as other arc statistics.

Here is an older version of the perl script above that you can call from an init script to configure your ARC on Earlier Solaris Releases (on boot or on-demand) with the above guidelines:

#!/bin/perl

use strict;
my $arc_max = shift @ARGV;
if ( !defined($arc_max) ) {
        print STDERR "usage: arc_tune <arc max>\n";
        exit -1;
}
$| = 1;
use IPC::Open2;
my %syms;
my $mdb = "/usr/bin/mdb";
open2(*READ, *WRITE,  "$mdb -kw") || die "cannot execute mdb";
print WRITE "arc::print -a\n";
while(<READ>) {
        my $line = $_;

        if ( $line =~ /^ +([a-f0-9]+) (.*) =/ ) {
                $syms{$2} = $1;
        } elsif ( $line =~ /^\}/ ) {
                last;
        }
}
# set c & c_max to our max; set p to max/2
printf WRITE "%s/Z 0x%x\n", $syms{p}, ( $arc_max / 2 );
print scalar <READ>;
printf WRITE "%s/Z 0x%x\n", $syms{c}, $arc_max;
print scalar <READ>;
printf WRITE "%s/Z 0x%x\n", $syms{c_max}, $arc_max;
print scalar <READ>;

[edit] RFEs

6488341 ZFS should avoiding growing the ARC into trouble (Fixed in Nevada, build 107)
6522017 The ARC allocates memory inside the kernel cage, preventing DR
6424665 ZFS/ARC should cleanup more after itself
6429205 Each zpool needs to monitor it's throughput and throttle heavy writers (Fixed in Nevada, build 87 and Solaris 10 10/08) For more information, see this link: New ZFS write throttle
6855793 ZFS minimum ARC size might be too large

[edit] Further Reading

http://blogs.sun.com/roch/entry/does_zfs_really_use_more

http://blogs.sun.com/realneel/entry/zfs_arc_statistics

[edit] Determining ARC memory consumption and other related stats

Ben Rockwood's arc_summary.pl script:
- http://cuddletech.com/blog/pivot/entry.php?id=979
- http://cuddletech.com/arc_summary/

Observe with MDB and KStats (checked as of Current Solaris Releases defined above):
- Tunable ZFS parameters, most of these can be set in /etc/system:
```
# echo "::zfs_params" | mdb -k
```
- Some settings and mostly statistics on ARC usage:
```
# echo "::arc" | mdb -k
```
- Solaris memory allocation; "Kernel" memory includes ARC:
```
# echo "::memstat" | mdb -k
```
- Stats of VDEV prefetch - how many (metadata) sectors were used from low-level prefetch caches:
```
# kstat -p zfs:0:vdev_cache_stats
```

[edit] File-Level Prefetching

ZFS implements a file-level prefetching mechanism labeled zfetch. This mechanism looks at the patterns of reads to files, and anticipates on some reads, reducing application wait times. The current code needs attention (RFE below) and suffers from 2 drawbacks:

Sequential read patterns made of small reads very often hit in the cache. In this case, the current code consumes a significant amount of CPU time trying to find the next I/O to issue, whereas performance is governed more by the CPU availability.

The zfetch code has been observed to limit scalability of some loads.

So, if CPU profiling, by using lockstat(1M) with -I argument or er_kernel as described here:

http://developers.sun.com/prodtech/cc/articles/perftools.html

shows significant time in zfetch_* functions, or if lock profiling (lockstat(1m)) shows contention around zfetch locks, then disabling file level prefetching should be considered.

Disabling prefetching can be achieved dynamically or through a setting in the /etc/system file.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#File-Level_Prefetching

[edit] Current Solaris 10 Releases and Solaris Nevada Releases

This syntax is provided starting in the Solaris 10 8/07 release and Solaris Nevada build 51 release.

Set dynamically:

echo zfs_prefetch_disable/W0t1 | mdb -kw

Revert to default:

echo zfs_prefetch_disable/W0t0 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_prefetch_disable = 1

[edit] Earlier Solaris Releases

Set dynamically:

echo zfetch_array_rd_sz/Z0x0 | mdb -kw

Revert to default:

echo zfetch_array_rd_sz/Z0x100000 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfetch_array_rd_sz = 0

[edit] RFEs

6412053 zfetch needs some love
6579975 dnode_new_blkid should first check as RW_READER (Fixed in Nevada, build 97)

[edit] Device-Level Prefetching

ZFS does device-level read-ahead in addition to file-level prefetching. When ZFS reads a block from a disk, it inflates the I/O size, hoping to pull interesting data or metadata from the disk. This data is stored in a 10MB LRU per-vdev cache, which can short-cut the ZIO pipeline if present in cache.

Prior to the Solaris Nevada build snv_70, the code caused problems for system with lots of disks because the extra prefetched data could cause congestion on the channel between the storage and the host. Tuning down the size by which I/O was inflated () had been effective for OLTP type loads in the past. The code is now only prefetching metadata, fixed by bug 6437054, and thus, is not expected to require any tuning.

This parameter can be important for workloads when ZFS is instructed to cache only metadata by setting the primarycache property per file system.

For workloads that have an extremely wide random reach into 100s of TB with little locality, then even metadata is not expected to be cached efficiently. Setting primarycache to metadata or even none needs to be investigated. In conjunction, device level prefetch tuning can help reduce the number of 64K IOPs done on behalf of the vdev cache for metadata.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device-Level_Prefetching

No tuning is required for Solaris Nevada releases, build 70 and after.

[edit] Previous Solaris 10 and Solaris Nevada Releases

Setting this tunable might only be appropriate in the Solaris 10 8/07 and Solaris 10 5/08 releases and Nevada releases from build 53 to build 69.

Set the following parameter in the /etc/system file:

set zfs:zfs_vdev_cache_bshift = 13

/* Comments
/* Setting zfs_vdev_cache_bshift with mdb crashes a system.
/* zfs_vdev_cache_bshift is the base 2 logarithm of  the size used to read disks. 
/* The default value of 16 means reads are issued in size of 1 << 16 = 64K. 
/* A value of 13 means disk reads are padded to 8K.

For earlier releases, see: http://blogs.sun.com/roch/entry/tuning_the_knobs

[edit] RFEs

6437054 vdev_cache wises up: increase DB performance by 16% (Fixed in Nevada, build 70 and Solaris 10 10/08)

[edit] Further Reading

http://blogs.sun.com/erickustarz/entry/vdev_cache_improvements_to_help

[edit] Device I/O Queue Size (I/O Concurrency)

ZFS controls the I/O queue depth for a given LUN with the zfs_vdev_max_pending parameter.

In Solaris Nevada, build 127, the zfs_vdev_max_pending default value has been changed to 10, which is good for write workloads and good for reads from disk. This default value might not be good for read queries from an array LUN that might be comprised of 10-15 disks or more.

In previous Solaris releases, the default is 35, which allows common SCSI and SATA disks to reach their maximum throughput under ZFS. However, having 35 concurrent I/Os means that the service times can be inflated for read workloads.

For NVRAM-based storage, it is not expected that a 35-deep queue is reached nor plays a significant role for write workloads since writes are interacting with the array caches and not disk spindles. In a storage array where LUNS are made of a large number of disk drives, the ZFS queue can become a limiting factor on read IOPS. This behavior is one of the underlying reasoning for the best practice of presenting as many LUNS as there are backing spindles to the ZFS storage pool. That is, if you work with LUNS from a 10 disk-wide array level raid-group, then using 5 to 10 LUNS to build a storage pool allows ZFS to manage enough of an I/O queue without the need to set this specific tunable.

However, when no separate intent log is in use and the pool is made of JBOD disks, using a small zfs_vdev_max_pending value, such as 10, can improve the synchronous write latency as those are competing for the disk resource.

The Solaris release now has the option of storing the ZIL on separate devices from the main pool. Using separate intent log devices can alleviate the need to tune this parameter for loads that are synchronously write intensive since those synchronous writes are not competing with a deep queue of non-synchronous writes.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29

Tuning is not expected to be effective for NVRAM-based storage arrays in the case where volumes are made of small number of spindles. However, when ZFS is presented with a volume made of a large (> 10) number of spindles, then this parameter can limit the read throughput obtained on the volume. The reason is that with a max of 1o or 35 queued I/Os per LUN, this can translate into less than 1 I/O per storage spindle, which is not enough for individual disks to deliver their IOPS. This issue would show up in iostat actv queue approaching the value of zfs_vdev_max_pending.

[edit] Previous Solaris 10 and Solaris Nevada Releases

Set dynamically:

echo zfs_vdev_max_pending/W0t10 | mdb -kw

Revert to default:

echo zfs_vdev_max_pending/W0t35 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_vdev_max_pending = 10

For earlier Solaris releases, see:

http://blogs.sun.com/roch/entry/tuning_the_knobs

[edit] Device Driver Considerations

Device drivers may also limit the number of outstanding I/Os per LUN. If you are using LUNs on storage arrays that can handle large numbers of concurrent IOPS, then the device driver constraints can limit concurrency. Consult the configuration for the drivers your system uses. For example, the limit for the QLogic ISP2200, ISP2300, and SP212 family FCl HBA (qlc) driver is described as the execution-throttle parameter in /kernel/drv/qlc.conf.

[edit] RFEs

6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops

[edit] Further Reading

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

[edit] Cache Flushes

If you've noticed terrible NFS or database performance on SAN storage array, the problem is not with ZFS, but with the way the disk drivers interact with the storage devices.

ZFS is designed to work with storage devices that manage a disk-level cache. ZFS commonly asks the storage device to ensure that data is safely placed on stable storage by requesting a cache flush. For JBOD storage, this works as designed and without problems. For many NVRAM-based storage arrays, a performance problem might occur if the array takes the cache flush request and actually does something with it, rather than ignoring it. Some storage arrays flush their large caches despite the fact that the NVRAM protection makes those caches as good as stable storage.

ZFS issues infrequent flushes (every 5 second or so) after the uberblock updates. The problem here is fairly inconsequential. No tuning is warranted here.

ZFS also issues a flush every time an application requests a synchronous write (O_DSYNC, fsync, NFS commit, and so on). The completion of this type of flush is waited upon by the application and impacts performance. Greatly so, in fact. From a performance standpoint, this neutralizes the benefits of having an NVRAM-based storage.

[edit] Tuning the Write Cache

Starting in the Solaris 10 5/09 release, ZFS is better able to adapt to characteristics of SAN storage devices and some of you will not experience any problem with ZFS. For example:

Sun StorEdge 9990 array (based on HDS storage devices) cache is mirrored for writes and backed up by batteries if site power fails. The 9990 requires no special settings to support ZFS besides the normal ones required for Solaris. No prerequisite microcode version is required, although it is always a good idea to be on the latest RGA code (currently 50-09-88 for 9990).
Hitachi Data System (HDS) midrange storage devices might need to be configured to ignore cache flushes.

If you experimentally observe that setting zfs_nocacheflush with mdb to have a dramatic effect on performance, such as a 5 times or more difference when extracting small tar files over NFS or dd'ing 8 KB to a raw zvol, then this indicates your storage is not friendly to ZFS.

Contact you storage vendor for instructions on how to tell the storage devices to ignore the cache flushes sent by ZFS. For Santricity based storage devices, instructions are documented in CR 6578220.

If you are not able to configure the storage device in an appropriate way, the preferred mechanism is to tune sd.conf specifically for your storage. See the instructions below.

As a last resort, when all LUNs exposed to ZFS come from NVRAM-protected storage array and procedures ensure that no unprotected LUNs will be added in the future, ZFS can be tuned to not issue the flush requests by setting zfs_nocacheflush. If some LUNs exposed to ZFS are not protected by NVRAM, then this tuning can lead to data loss, application level corruption, or even pool corruption. In some NVRAM-protected storage arrays, the cache flush command is a no-op, so tuning in this situation makes no performance difference.

NOTE: Cache flushing is commonly done as part of the ZIL operations. While disabling cache flushing can, at times, make sense, disabling the ZIL does not.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes

NOTE: If you are carrying forward an /etc/system file, please verify that any changes made still apply to your current release. Help us rid the world of /etc/system viruses.

[edit] How to Tune Cache Sync Handling Per Storage Device

Since ZFS is not aware of the nature of the storage and if NVRAM is present, the best way to fix this issue is to tell the storage to ignore the requests.

Template:Draft

A recent fix is that the flush request semantic has been qualified to instruct storage devices to ignore the requests if they have the proper protection. This change required a fix to our disk drivers and for the storage to support the updated semantics.

If the storage device does not recognize this improvement, here are instructions to tell the Solaris OS not to send any synchronize cache commands to the array. If you use these instructions, make sure all targeted LUNS are indeed protected by NVRAM.

Caution: All cache sync commands are ignored by the device. Use at your own risk.

Use the format utility to run the inquiry subcommand on a LUN from the storage array. For example:

# format
.
.
.
Specify disk (enter its number): x
format> inquiry
Vendor:   ATA     
Product:  Super Duper      
Revision: XXXX
format>

Select one of the following based on your architecture:
- ssd driver (many SPARC FC drivers): Add similar lines to the /kernel/drv/ssd.conf file
```
ssd-config-list = "ATA     Super Duper     ", "nvcache1";
nvcache1=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;
```
- sd driver (X64 and a few SPARC FC drivers): Add similar lines to the /kernel/drv/sd.conf file
```
sd-config-list = "ATA     Super Duper     ", "nvcache1";
nvcache1=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;
```
Note: In the above examples, nvcache1 is just a token in sd.conf. You could use any similar token.
Add whitespace to make the vendor ID (VID) 8 characters long (here "ATA ") and Product ID (PID) 16 characters long (here "Super Duper ") in the sd-config-list entry as illustrated above.
After the sd.conf or ssd.conf modifications and reboot, you can tune zfs_nocacheflush back to it's default value (of 0) with no adverse effect on performance.

Template:Draft

For more cache tuning resource information, see:

http://blogs.digitar.com/jjww/?itemid=44.

http://forums.hds.com/index.php?showtopic=497.

[edit] Current Solaris 10 Releases and Solaris Nevada Releases

Starting in the Solaris 10 5/08 release and Solaris Nevada build 72 release, the sd and ssd drivers should properly handle the SYNC_NV bit, so no changes should be needed.

[edit] Previous Solaris 10 and Solaris Nevada Releases

Set dynamically:

 echo zfs_nocacheflush/W0t1 | mdb -kw

Revert to default:

 echo zfs_nocacheflush/W0t0 | mdb -kw

Set the following parameter in the /etc/system file:

 set zfs:zfs_nocacheflush = 1

Risk: Some storage might revert to working like a JBOD disk when their battery is low, for instance. Disabling the caches can have adverse effects here. Check with your storage vendor.

[edit] Earlier Solaris Releases

Set the following parameter in the /etc/system file:

 set zfs:zil_noflush = 1

Set dynamically:

 echo zil_noflush/W0t1 | mdb -kw

Revert to default:

 echo zil_noflush/W0t0 | mdb -kw

Risk: Some storage might revert to working like a JBOD disk when their battery is low, for instance. Disabling the caches can have adverse effects here. Check with your storage vendor.

[edit] RFEs

6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE to SBC-2 devices (Fixed in Nevada, build 74 and Solaris 10 5/08)

[edit] Disabling the ZIL (Don't)

ZIL stands for ZFS Intent Log. It is used during synchronous writes operations. The ZIL is an essential part of ZFS and should never be disabled. Significant performance gains can be achieved by not having the ZIL, but that would be at the expense of data integrity. One can be infinitely fast, if correctness is not required.

One reason to disable the ZIL is to check if a given workload is significantly impacted by it. A little while ago, a workload that was a heavy consumer of ZIL operations was shown to not be impacted by disabling the ZIL. It convinced us to look elsewhere for improvements. If the ZIL is shown to be a factor in the performance of a workload, more investigation is necessary to see if the ZIL can be improved.

The OpenSolaris 2008 releases, Solaris 10 10/08 release, and Solaris Nevada build 68 or later release has the option of storing the ZIL on separate log devices from the main pool. Using separate possibly low latency devices for the Intent Log is a great way to improve ZIL sensitive loads. This feature is not currently supported on a root pool.

In general, negative ZIL performance impacts are worse on storage devices which have high write latency. HDD write latency is on the order of 10-20 ms. Many hardware RAID arrays have nonvolatile write caches where the write latency can be on the order of 1-10 ms. SSDs have write latency on the order of 0.2 ms. As the write latency decreases, the negative performance affects are diminished, which is why using an SSD as a separate ZIL log is a good thing. For hardware RAID arrays with nonvolatile cache, the decision to use a separate log device is less clear. YMMV.

The size of the separate log device may be quite small. A rule of thumb is that you should size the separate log to be able to handle 10 seconds of your expected synchronous write workload. It would be rare to need more than 100 MB in a separate log device, but the separate log must be at least 64 MB.

Caution: Disabling the ZIL on an NFS server can lead to client side corruption. The ZFS pool integrity itself is not compromised by this tuning.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29

[edit] Current Solaris Releases

In current versions of Solaris, there is a per-filesystem ZFS property to disable/enable the ZIL.

If you must, then:

 zfs set sync=disabled <filesystem>

The developer of this feature has an explanation on his blog.

[edit] Older Solaris Releases

Before Oracle Solaris 11 Express, build 140, there is a system parameter that can be modified to disable the ZIL.

If you must, then:

echo zil_disable/W0t1 | mdb -kw

Revert to default:

echo zil_disable/W0t0 | mdb -kw

Note!: The zil_disable tunable is only evaluated during dataset mount. While this can be tuned dynamically, to reap the benefits you must zfs umount and then zfs mount (or reboot, or export/import the pool, etc).

[edit] RFEs

6280630 zil synchronicity

[edit] Further Reading

[edit] Disabling Metadata Compression

Caution: This tuning needs to be researched as it's now apparent that the tunable applies only to indirect blocks leaving a lot of metadata compressed anyway.

With ZFS, compression of data blocks is under the control of the file system administrator and can be turned on or off by using the command "zfs set compression ...".

On the other hand, ZFS internal metadata is always compressed on disk, by default. For metadata intensive loads, this default is expected to gain some amount of space (a few percentages) at the expense of a little extra CPU computation. However, a bigger motivation exists to have metadata compression on. For directories that grow to millions of objects then shrink to just a few, metadata compression saves large amounts of space (>>10X).

In general, metadata compression can be left as is. If your workload is CPU intensive (say > 80% load) and kernel profiling shows medata compression is a significant contributor and we are not expected to create and shrink huge directories, then disabling metadata compression can be attempted with the goal of providing more CPU to handle the workload.

I/O on flash storage devices is aligned along 4 KB boundaries. If metadata compression is enabled, the I/O on flash storage devices might become unaligned. You might consider disabling metadata compression to resolve the I/O alignment problem if you are using flash devices for primary storage. Using separate log devices on flash devices are not effected by the alignment problem.

If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_Metadata_Compression

[edit] Current Solaris 10 Releases and Solaris Nevada Releases

This syntax is available starting in the Solaris 10 11/06 release and Solaris Nevada build 52 release.

Set dynamically:

echo zfs_mdcomp_disable/W0t1 | mdb -kw

Revert to default:

echo zfs_mdcomp_disable/W0t0 | mdb -kw

Set the following parameter in the /etc/system file:

set zfs:zfs_mdcomp_disable = 1

[edit] Earlier Solaris Releases

Not tunable.

[edit] RFEs

6391873 metadata compression should be turned back on (Fixed in Nevada, build 36)

[edit] Tuning ZFS for Database Performance

In Solaris Nevada, build 122 as well as in the 2009.Q3 release of fishworks, you can customize the separate intent log behavior per dataset by setting the logbias property. Instructing ZFS to manage database datafiles as well as index files with regard to throughput and not latency, will trigger the optimization described below. For example, in addition to bypassing the use of the log device operation for such datasets, ZFS is instructed to favor the handling of ZIL blocks with the leaner protocol making the tuning described below unnecessary.

In an Oracle database environment, all writes are synchronous and are first handled by the ZFS ZIL layer. Within the ZIL are 2 important paths, either the data for a write is copied within a ZIL block or the I/O for the block is done in the main storage pool with the ZIL block referencing the data. The first path generally leads to lower latency because the ZIL can commit transactions using a single I/O. However, this path means that data is committed twice through the pool, once to the ZIL and once to the main pool as part of the transaction group.

When ZFS is operating on top of NVRAM based storage, the latency is usually good and is less of a concern. However, the need to preserve storage throughput can be important specially if such storage is shared between groups. It is possible to avoid the double write done by the ZIL by setting the zfs_immediate_write_sz parameter to be lower than the database block size. This tuning ensures that all writes used by the database go through the indirect path and can lead to a 2X reduction in total storage throughput required to serve a workload.

The tuning is ineffective in a storage pool in which there is a separate intent log. In this instance, the data is only committed once to the main storage pool.
Beware that in a non-database environment, this tuning can have the opposite effect and can lead to highly inflated amount of total I/O.

What makes this tuning suitable for database environments is that many of the writes are full record overwrites. The inflation comes when doing a partial record re-write in which a synchronous write system call of size greater than zfs_immediate_write_sz to a file with 128K records causes a full 128K record output. This needs to be considered in regards to the redo log file. If the average size of writes to redo log files is greater than zfs_immediate_write_sz, but many times smaller than the recordsize used for redo logs, then some redo log inflation is expected from this tuning. To avoid this inflation, the redologs can be set on a storage pool in which there is a separate intent log.

Set the following parameter in the /etc/system file:

* See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#zfs_immediate_write_sz
* Reduce write throughput required by Oracle data files with potential impact on redo logs
* To be used in pure database environment (full block overwrites) with avg redolog transactions either  
* smaller than the tunable or greater than redo log recordsize.
set zfs:zfs_immediate_write_sz = 8191

For x86 systems, when the db_block_size and the recordsize are aligned to the system page size of 4k, then it is better to set zfs_immediate_write_sz to a little less than 4096, as 4000.

For more information about optimizing ZFS for database performance, see ZFS for Databases.

[edit] Tuning ZFS When Using PCIe Flash Accelerator Cards

Consider using LUNs or low latency disk managed by a controller with persistent memory for the ZFS intent log, if available. This option can be considered more cost effective than using flash for low latency commits. The size of the log devices must only be large enough to hold 10 seconds of maximum write throughput. Examples would include a storage array based LUN, or a disk connected to an HBA with a battery protected write cache.
If no such device is available, segment a separate pool of F40 modules and use them as log devices in a ZFS storage pool.
The F40 supports 4 independent flash modules. Flash modules may be used as ZFS log devices to reduce commit latency, particularly if used in an NFS server. A single flash module of an F40 used as a ZFS log device can reduce latency of single lightly threaded operation by an order of magnitude (~10X). More F40 flash modules can be striped together to achieved higher throughput for large amounts of synchronous operation. For example, when ZFS is exporting iSCSI LUNs.
Log devices should be mirrored for reliability. For maximum protection, the mirrors should be set up on separate F40 cards.
Some of the F20 DOMs or F5100 FMODs that are not used as log devices can be used as second level cache devices to both offload IOPS from primary disk storage and to improve read latency for commonly used data recently evicted from the level one ARC cache.

[edit] Adding Flash Accelerators as ZFS Log or Cache Devices

Be very careful with zpool add commands. Mistakenly adding a log device as a normal pool device is a mistake that will require you to destroy and restore the pool from scratch. Individual log devices themselves can be removed from a pool.
We recommend you are familiar with the zpool command before attempting this on active storage. For this purpose, scratch files can be used as a replacement for a real disk.
Use one such line for every log or cache device. If multiple devices are specified, they will be striped together.
Fo more information, see zpool.1m.

An F40 flash module, c4t1d0, can be added as a ZFS log device:

   # zpool add pool log c4t1d0

If 2 F40 flash modules are available, you can add mirrored log devices:

   # zpool add pool log mirror c4t1d0 c4t2d0

Available F20 DOMs or F5100 FMODs can be added as a cache device for reads.

  # zpool add pool cache c4t3d0

You can't mirror cache devices, they will be striped together.

  # zpool add pool cache c4t3d0 c4t4d0

[edit] Disabling Metadata Compression for Flash Accelerator Performance

Note: If you are running the latest S10 Patches or S11, this step is no longer necessary. ZFS now runs as a 4k-native file system F20 and F5100 devices. Enabling Metadata compression (default Solaris setting) is just fine, and has run better in the cases I have tested.

The F40 Flash Accelerator achieve optimal performance when subjected to I/O requests that are restricted in size and alignment to multiples of 8k. For this reason, it is recommended to disable metadata compression. Disable metadata compression by adding the following entry to /etc/system:

   set zfs:zfs_mdcomp_disable = 1

Record Size

Large performance gains can be realized by reducing the default recordsize used by ZFS, particularly when running database workloads. The ZFS recordsize should match the database blocksize. Note that the recordsize setting must be in place prior to when data is loaded on zfs. Additional reading on this subject may be found athttp://blogs.sun.com/roch/entry/tuning_zfs_recordsize. On the F40, recordsizes should be no smaller than 8k. The following example illustrates how to set the recordsize to 16k:

   zfs set recordsize=16k mypool/myfs

[edit] Other Notes

To provide optimal data protection, it is important to ensure that both chksums and cacheflush are enabled; by default, both are enabled when zpools or zvols are created.

[edit] Additional ZFS References

ZFS Best Practices

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

ZFS Dynamics

http://blogs.oracle.com/roch/entry/the_dynamics_of_zfs

ZFS Links

http://opensolaris.org/os/community/zfs/links/

Er_kernel profiling

http://developers.sun.com/prodtech/cc/articles/perftools.html

ZFS and Database/OLTP
ZFS and NFS

http://blogs.oracle.com/roch/entry/nfs_and_zfs_a_fine

ZFS and Direct I/O

http://blogs.oracle.com/roch/entry/zfs_and_directio

ZFS Separate Intent Log (SLOG)

http://blogs.oracle.com/perrin/entry/slog_blog_or_blogging_on

[edit] Integrated RFEs that introduced or changed tunables

snv_51 : 6477900 want more /etc/system tunables for ZFS performance analysis
snv_52 : 6485204 more tuneable tweakin
snv_53 : 6472021 vdev knobs can not be tuned

Category: ZFS

Jay Taylor's notes