Find it

Tuesday, July 6, 2010

ZFS Revisited

Understanding ZFS & ZFS ARC/L2ARC

It's being almost a year that I’m working on ZFS Filesystem Administration and just finished migrating all of our Solaris servers from UFS root to ZFS root, so now OS data and Application data resides on ZFS. The intention to write this document is to have some handy notes about ZFS & it’s cache mechanism which always plays a vital role in terms of system performance!

This is just a revisit the features/techniques of this great filesystem known as ZFS.

Let’s start from assumptions. In last one year I’ve experienced that there were some assumption that I and others were carrying but before we jump into anything assumption queue should get clear so let’s do that.

Some assumptions about ZFS –

• Tuning ZFS is Evil (TRUE)
• ZFS doesn’t require tuning (FALSE)
• ZFS is a memory hog (TRUE)
• ZFS is slow (FLASE)
• ZFS won’t allow corruption (FLASE)

Alright then, now we clear on few of assumptions and now known to the facts! Let’s take a look at few important features of ZFS –

• ZFS is the world’s first 128-bit file system and as such has a huge capacity.
• Capacity wise Single filesystems 512TB+ (theoretical 264 devices * 264 bytes)
• Trillions of files in a single file system (theoretical 248 files per fileset)
• ZFS is Transaction based, copy-on-write filesystem, so no "fsck" is required
• High Internal data redundancy (New RAID level called RAIDz)
• End-to-end checksum of all data/metadata with Strong algorithm (SHA-256) but CPU consuming sometimes! So can be turn off on data and ON on metadata. Checksums are used to validate blocks.
• Online integrity verification and reconstruction
• Snapshots, filesets, compression, encryption facilities and much more!

What is the ZFS filesystem capacity?

264 — Number of snapshots of any file system
248 — Number of entries in any individual directory
16 EB (264 bytes) — Maximum size of a file system
16 EB — Maximum size of a single file
16 EB — Maximum size of any attribute
256 ZB (278 bytes) — Maximum size of any zpool
256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system)
264 — Number of devices in any zpool
264 — Number of zpools in a system
264 — Number of file systems in a zpool

How about data redundancy in ZFS?

ZFS supports following RAID configurations


> Stripes (RAID-0)
> Mirroring (RAID-1)
> RAID-Z (Similar to RAID-5)
> RAID-Z2 (Double parity, similar to RAID-6)

Hey, what is it…WOW no FSCK needed? HOW?

Yes, it’s true. ZFS does not need fsck to correct filesystem errors/corruptions as due to COW behavior. ZFS maintains its records as a tree of blocks. Every block is accessible via a single block called the “uber-block”. When you change an existing block, instead of getting overwritten a copy of the data is made and then modified before being written to disk this is Copy on Write (COW). This ensures ZFS never overwrites live data. This guarantees the integrity of the file system as a system crash still leaves the on disk data in a completely consistent state. There is no need for fsck. Ever.

ZFS is self healing data feature?

Yes, provided that –


• If a Bad Block is found ZFS can repair it so long as it has another copy
• RAID-1 - ZFS can “heal” bad data blocks using the mirrored copy
• RAID-Z/Z2 - ZFS can “heal” bad data blocks using parity


Also note that self healing is avail for ZFS metadata but not to actual application data.

ZFS Best Practices

• Tune recordsize only on fixed records DB files
• Mirror for performance
• 64-bit kernel (allows greater ZFS caches)
• configure swap (don't be scared by low memory)
• Don't slice up devices (confuses I/O scheduler)
• Isolate DB log writer if that is critical (use few devices)
• Separate Root pool (system's identify) and data pools (system's function)
• Keep pool below 80% full (helps COW)
• Usage of snapshots/clones for backup/DR purpose

Okay then, let’s talk about ARC first.

ARC stands for “Adaptive/Adjustable Replacement Cache” - ARC is a very fast cache located in the server’s memory (RAM). The amount of ARC available in a server is usually all of the memory except for 1GB or 3/4th of main memory whichever is greater so simple calculation shows 3GB on your 4GB machine.

Originally, ARC concepts has been first described and invented by two IBM researchers Megiddo and Modha in November 2003 but ZFS ARC is significantly modified version of original ARC design.

The major differences of ZFS ARC is –

• The ZFS ARC is variable in size and can react to the available memory & maybe that’s the reason it is called as “Adjustable Replacement Memory”. It can grow in size when memory is available or it can shrink in size when memory is needed for other processes/jobs.
• The designed proposed by Megiddo and Modha assumes the block size should be same but ZFS ARC works with multiple block sizes.
• Under ZFS ARC you can lock pages in the cache to excuse them from the removal. This prevents the cache to remove pages, that are currently in use. This feature is not in original ARC design.

ZFS ARC stores ZFS data and metadata information from all active storage pools in physical memory (RAM) by default as much as possible, except 1 GB of RAM or 3/4th of main memory BUT I would say this is just a thumb rule or theoretical rule and depending on the environment tuning needs to be done for better system performance. Consider limiting the maximum ARC memory footprint in the following situations:

When a known amount of memory is always required by an application. Databases often fall into this category.
• On platforms that support dynamic reconfiguration of memory boards, to prevent ZFS from growing the kernel fence in onto all boards.
• A system that requires large memory pages might also benefit from limiting the ZFS cache, which tends to breakdown large pages into base pages.
• Finally, if the system is running another non-ZFS file system, in addition to ZFS, it is advisable to leave some free memory to host that other file system's caches.

The ARC grows and consumes memory on the theory that no need exists to return data to the system while there is still plenty of free memory. When the ARC has grown and near to go exceed memory pressure exists, for example, when a new application starts up, then the ARC releases its hold on memory. Lastly and very important to note that ZFS is not designed to steal memory from applications, ZFS is very innocent!

By default, UFS uses page caching managed virtual memory system however ZFS does not use the page caching except few type of files! ZFS use ARC. There can be only one ARC per system however caching policy can be change per dataset basis.

As I said before, to make sure application/databases has enough dedicated memory available you need to perform tuning or capping on ARC.

# prtconf | grep Mem
Memory size: 98304 Megabytes

# grep zfs /etc/system
set zfs:zfs_arc_min = 1073741824
set zfs:zfs_arc_max = 17179869184

So here, I’ve 96GB total physical memory and ARC has capped at 16G. So around ~17% has been capped for arc_max and 1G for arc_min

The following command gives the current memory size in bytes that is used by ZFS cache:

# kstat zfs::arcstats:size
module: zfs instance: 0
name: arcstats class: misc
size 15325351536

The thumb rules for tuning ARC will be –

• know your future application memory requirements say if it required 20% memory in overall then it makes sense to cap the ARC such that it does not consume more than the remaining 80% of memory.
• Understand and feel your existing applications well. If application known or indeed uses large memory pages then putting cap on ARC will be a bottleneck to that application as limiting the ARC prevents ZFS from breaking up the pages and fragmenting the memory. Limiting the ARC preserves the availability of large pages.
• It’s certainly not easy to tune ARC and need quite a bit deep understanding of applications and their needs though it always helps to have mentioned scripts handy and added to your tools/script depot - arc_summary.pl (By Ben Rockwood) & arcstat.pl (By Neelakanth Nadgir)

Now let’s have a small discussion about L2ARC

L2ARC is a new layer between Disk and the cache (ARC) in main memory for ZFS. It uses dedicated storage devices to hold cached data. The main role of this cache is to boost the performance of random read workloads. The intended L2ARC devices include 10K/15K RPM disks like short-stroked disks, solid state disks (SSD), and other media with substantially faster read latency than disk.

L2ARC, or Level 2 caching in ARC makes it possible to use a disk in addition to RAM for caching reads. This improves read performance, which can otherwise be slow because of the fragmentation caused by the copy-on-write model used by ZFS.

In actual L2ARC cache is supported in Solaris 10 update 6 however as far as my knowledge L2ARC is well supported and functional in Solaris 10 Update 8 and onwards.

The L2ARC attempts to cache data from the ARC before it is evicted so the L2ARC populates its cache by periodically reading data from the tail of the ARC. The data in the tail of the ARC is the data that hasn’t been used for while. I’m still reading through and understanding on how exactly L2ARC works and how it relates to ARC and all that stuffs.

Am I really need L2ARC?

It all depends on the situation, if you have a system where a lot of data is being read frequently and you want better performance, then yes. But if not it still won’t hurt performance, only improve it.

There is lot more things to talk about & to share however you can also find this information from where I found like go through Sun officials books, Solarisinternal.com and some great blogs!

It was a great ZFS year and I’m glad that Sun (now Oracle) folks worked and working hard to make it as a Legend in filesystem arena!

2 comments:

  1. typo: http://www.solarisinternals.com

    ReplyDelete
  2. after server rebooted, swap is gone. So I have tried to configure the swap device in ZFS, it is showing below error message

    #swap -a /dev/zvol/dsk/rpool/swap
    /dev/zvol/dsk/rpool/swap:too defragemented

    zpool status -x command shows " zpool upgrade error".

    What I will do?


    ReplyDelete