UNIX my first love...: July 2010

Thursday, July 8, 2010

UFS to ZFS using LiveUpgrade, all went well but it made me sad when I deleted the snapshots/clones for zonepath

UFS to ZFS using LiveUpgrade, all went well but it made me sad when I deleted the snapshots/clones for zonepath.

Few weeks ago I made a mistake, it was height of stupidity as an experienced and sensible system administrator. Few weeks back while migrating my UFS based system to ZFS I landed up in a situation where it made me upset for a long time. Below is the brief history on problem.

I'm currently working on UFS to ZFS migration on systems having zones which are already migrated to ZFS zonepath/zone root.

In last upgrade/migration, the mistake I made is - I deleted the snapshots/clones made by LiveUpgrade for zone's zonepath/zoneroot.

The story behind is - When we create a ABE which will boot to ZFS LiveUpgrade program creates a snapshot and that snapshot creates a clone of /[zonename]/zonepath as shown below -

[...]
Creating snapshot for zone1/zonepath on mailto:zone1/zonepath@s10s_u8wos_08a].
Creating clone for [zone1/zonepath@s10s_u8wos_08a] on zone1/zonepath-s10s_u8wos_08a.
Creating snapshot for zone2/zonepath on mailto:zone2/zonepath@s10s_u8wos_08a].
Creating clone for [zone2/zonepath@s10s_u8wos_08a] on zone2/zonepath-s10s_u8wos_08a.
[...]
After creating ABE for ZFS we can luactivate it and boot into ZFS.

# luactivate s10s_u8wos_08a

Till this point everything Worked fine.

# init 6
# lustatus
Boot Environment Is Active Active Can Copy
Name Complete Now On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
Sol10u8 yes no no yes -
s10s_u8wos_08a yes yes yes no -

From above output you can see I'm in ZFS boot environment.

Now the real stupidity stunt starts onwards below portion. Unfortunately and lack of study before task execution, I deleted snapshots/clones using -

# zfs list -H -t snapshot | cut -f 1
zone1/zonepath-Sol10u8@s10s_u8wos_08a
zone2/zonepath-Sol10u8@s10s_u8wos_08a
zone3/zonepath-Sol10u8@s10s_u8wos_08a

# for snapshot in `zfs list -H -t snapshot
cut -f 1`; do zfs destroy -R -f $snapshot;done

From this point onward I lost my zonepaths and could not able to boot my zones. WOW... Horrible Sunday starts here!

After few weeks after I found the answer as described below -

Solution -

Thanks to Alex and Julien for their suggestions and help.

I removed/destroyed the snapshots/clones meant for new ZFS BE

After this point onward tried mounting the UFS BE but due to unclean zonepaths it failed and left nebulous mouintpoints. To clear those mount points you rebooted the server.
This point onward, after analyzing the ICF* files decided to re-created the zonepath using old UFs BE. Command zpool history helped me here.

# zfs clone zone1/zonepath-d100@s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set mountpoint=/zone1/zonepath-s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set zpdata:rbe=s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set zpdata:zn=zone1 zone1/zonepath-s10s_u8wos_08a
# zfs set mountpoint=/zone1/zonepath-s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set canmount=off zone1/zonepath-s10s_u8wos_08a
# zfs set canmount=on zone1/zonepath-s10s_u8wos_08a
# zfs rename zone1/zonepath-s10s_u8wos_08a zone1/zonepath
# zfs set mountpoint =/zone1/zoenpath zone1/zonepath

likewise one by one recreated/recovered zonepaths snapshots/clones for all the zones which were at fault.

This way, zonepath backed again and fall back to UFS BE went soft however the main thing here, was the patching needs to be re-done here as the zonepath were re-created using unpatched UFS BE.

After backout, deleted the ZFS BE & this point onwards I created new ZFS BE and patched it once again.

I hope no one is stupid like me to make such a mistakes but if made then here you go!

Tuesday, July 6, 2010

ZFS Revisited

Understanding ZFS & ZFS ARC/L2ARC

It's being almost a year that I’m working on ZFS Filesystem Administration and just finished migrating all of our Solaris servers from UFS root to ZFS root, so now OS data and Application data resides on ZFS. The intention to write this document is to have some handy notes about ZFS & it’s cache mechanism which always plays a vital role in terms of system performance!

This is just a revisit the features/techniques of this great filesystem known as ZFS.

Let’s start from assumptions. In last one year I’ve experienced that there were some assumption that I and others were carrying but before we jump into anything assumption queue should get clear so let’s do that.

Some assumptions about ZFS –

• Tuning ZFS is Evil (TRUE)
• ZFS doesn’t require tuning (FALSE)
• ZFS is a memory hog (TRUE)
• ZFS is slow (FLASE)
• ZFS won’t allow corruption (FLASE)

Alright then, now we clear on few of assumptions and now known to the facts! Let’s take a look at few important features of ZFS –

• ZFS is the world’s first 128-bit file system and as such has a huge capacity.
• Capacity wise Single filesystems 512TB+ (theoretical 264 devices * 264 bytes)
• Trillions of files in a single file system (theoretical 248 files per fileset)
• ZFS is Transaction based, copy-on-write filesystem, so no "fsck" is required
• High Internal data redundancy (New RAID level called RAIDz)
• End-to-end checksum of all data/metadata with Strong algorithm (SHA-256) but CPU consuming sometimes! So can be turn off on data and ON on metadata. Checksums are used to validate blocks.
• Online integrity verification and reconstruction
• Snapshots, filesets, compression, encryption facilities and much more!

What is the ZFS filesystem capacity?

264 — Number of snapshots of any file system
248 — Number of entries in any individual directory
16 EB (264 bytes) — Maximum size of a file system
16 EB — Maximum size of a single file
16 EB — Maximum size of any attribute
256 ZB (278 bytes) — Maximum size of any zpool
256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system)
264 — Number of devices in any zpool
264 — Number of zpools in a system
264 — Number of file systems in a zpool

How about data redundancy in ZFS?

ZFS supports following RAID configurations

> Stripes (RAID-0)
> Mirroring (RAID-1)
> RAID-Z (Similar to RAID-5)
> RAID-Z2 (Double parity, similar to RAID-6)

Hey, what is it…WOW no FSCK needed? HOW?

Yes, it’s true. ZFS does not need fsck to correct filesystem errors/corruptions as due to COW behavior. ZFS maintains its records as a tree of blocks. Every block is accessible via a single block called the “uber-block”. When you change an existing block, instead of getting overwritten a copy of the data is made and then modified before being written to disk this is Copy on Write (COW). This ensures ZFS never overwrites live data. This guarantees the integrity of the file system as a system crash still leaves the on disk data in a completely consistent state. There is no need for fsck. Ever.

ZFS is self healing data feature?

Yes, provided that –

• If a Bad Block is found ZFS can repair it so long as it has another copy
• RAID-1 - ZFS can “heal” bad data blocks using the mirrored copy
• RAID-Z/Z2 - ZFS can “heal” bad data blocks using parity

Also note that self healing is avail for ZFS metadata but not to actual application data.

ZFS Best Practices –

• Tune recordsize only on fixed records DB files
• Mirror for performance
• 64-bit kernel (allows greater ZFS caches)
• configure swap (don't be scared by low memory)
• Don't slice up devices (confuses I/O scheduler)
• Isolate DB log writer if that is critical (use few devices)
• Separate Root pool (system's identify) and data pools (system's function)
• Keep pool below 80% full (helps COW)
• Usage of snapshots/clones for backup/DR purpose

Okay then, let’s talk about ARC first.

ARC stands for “Adaptive/Adjustable Replacement Cache” - ARC is a very fast cache located in the server’s memory (RAM). The amount of ARC available in a server is usually all of the memory except for 1GB or 3/4th of main memory whichever is greater so simple calculation shows 3GB on your 4GB machine.

Originally, ARC concepts has been first described and invented by two IBM researchers Megiddo and Modha in November 2003 but ZFS ARC is significantly modified version of original ARC design.

The major differences of ZFS ARC is –

• The ZFS ARC is variable in size and can react to the available memory & maybe that’s the reason it is called as “Adjustable Replacement Memory”. It can grow in size when memory is available or it can shrink in size when memory is needed for other processes/jobs.
• The designed proposed by Megiddo and Modha assumes the block size should be same but ZFS ARC works with multiple block sizes.
• Under ZFS ARC you can lock pages in the cache to excuse them from the removal. This prevents the cache to remove pages, that are currently in use. This feature is not in original ARC design.

ZFS ARC stores ZFS data and metadata information from all active storage pools in physical memory (RAM) by default as much as possible, except 1 GB of RAM or 3/4th of main memory BUT I would say this is just a thumb rule or theoretical rule and depending on the environment tuning needs to be done for better system performance. Consider limiting the maximum ARC memory footprint in the following situations:

• When a known amount of memory is always required by an application. Databases often fall into this category.
• On platforms that support dynamic reconfiguration of memory boards, to prevent ZFS from growing the kernel fence in onto all boards.
• A system that requires large memory pages might also benefit from limiting the ZFS cache, which tends to breakdown large pages into base pages.
• Finally, if the system is running another non-ZFS file system, in addition to ZFS, it is advisable to leave some free memory to host that other file system's caches.

The ARC grows and consumes memory on the theory that no need exists to return data to the system while there is still plenty of free memory. When the ARC has grown and near to go exceed memory pressure exists, for example, when a new application starts up, then the ARC releases its hold on memory. Lastly and very important to note that ZFS is not designed to steal memory from applications, ZFS is very innocent!

By default, UFS uses page caching managed virtual memory system however ZFS does not use the page caching except few type of files! ZFS use ARC. There can be only one ARC per system however caching policy can be change per dataset basis.

As I said before, to make sure application/databases has enough dedicated memory available you need to perform tuning or capping on ARC.

# prtconf | grep Mem
Memory size: 98304 Megabytes

# grep zfs /etc/system
set zfs:zfs_arc_min = 1073741824
set zfs:zfs_arc_max = 17179869184

So here, I’ve 96GB total physical memory and ARC has capped at 16G. So around ~17% has been capped for arc_max and 1G for arc_min

The following command gives the current memory size in bytes that is used by ZFS cache:

# kstat zfs::arcstats:size
module: zfs instance: 0
name: arcstats class: misc
size 15325351536

The thumb rules for tuning ARC will be –

• know your future application memory requirements say if it required 20% memory in overall then it makes sense to cap the ARC such that it does not consume more than the remaining 80% of memory.
• Understand and feel your existing applications well. If application known or indeed uses large memory pages then putting cap on ARC will be a bottleneck to that application as limiting the ARC prevents ZFS from breaking up the pages and fragmenting the memory. Limiting the ARC preserves the availability of large pages.
• It’s certainly not easy to tune ARC and need quite a bit deep understanding of applications and their needs though it always helps to have mentioned scripts handy and added to your tools/script depot - arc_summary.pl (By Ben Rockwood) & arcstat.pl (By Neelakanth Nadgir)

Now let’s have a small discussion about L2ARC

L2ARC is a new layer between Disk and the cache (ARC) in main memory for ZFS. It uses dedicated storage devices to hold cached data. The main role of this cache is to boost the performance of random read workloads. The intended L2ARC devices include 10K/15K RPM disks like short-stroked disks, solid state disks (SSD), and other media with substantially faster read latency than disk.

L2ARC, or Level 2 caching in ARC makes it possible to use a disk in addition to RAM for caching reads. This improves read performance, which can otherwise be slow because of the fragmentation caused by the copy-on-write model used by ZFS.

In actual L2ARC cache is supported in Solaris 10 update 6 however as far as my knowledge L2ARC is well supported and functional in Solaris 10 Update 8 and onwards.

The L2ARC attempts to cache data from the ARC before it is evicted so the L2ARC populates its cache by periodically reading data from the tail of the ARC. The data in the tail of the ARC is the data that hasn’t been used for while. I’m still reading through and understanding on how exactly L2ARC works and how it relates to ARC and all that stuffs.

Am I really need L2ARC?

It all depends on the situation, if you have a system where a lot of data is being read frequently and you want better performance, then yes. But if not it still won’t hurt performance, only improve it.

There is lot more things to talk about & to share however you can also find this information from where I found like go through Sun officials books, Solarisinternal.com and some great blogs!

It was a great ZFS year and I’m glad that Sun (now Oracle) folks worked and working hard to make it as a Legend in filesystem arena!

Saturday, July 3, 2010

Swap is in use by Live Upgrade!!!

Just now working on few ZFS post migration cleanup task like creating new ZFS filesystems which are currently on UFS SVM devices, rsync them, ludelete the UFS BE and finally add empty disk to the rpool. While working on this I encountered an error while deleting the swap space holding one of the soft partition.

I was in a process of clearing the metadevices and one of the soft partition holding 32G of swap space and system was not allowing me to delete it. The error I was getting was –

# swap -l
swapfile dev swaplo blocks free
/dev/md/dsk/d34 85,34 16 67108848 67108848

# swap -d /dev/md/dsk/d34
/dev/md/dsk/d34: Not enough space

This was obvious because system was using this swap as a active swap.

# top -c
last pid: 5639; load avg: 3.78, 3.40, 3.41; up 25+15:39:04
1327 processes: 1258 sleeping, 64 zombie, 5 on cpu
CPU states: 91.0% idle, 2.5% user, 6.5% kernel, 0.0% iowait, 0.0% swap
Memory: 96G phys mem, 14G free mem, 32G swap, 32G free swap

Then I was wondering, where is my ZFS swap volume gone? Why system is not using this volume? So I tried making it active using swap -a but I failed to do it & system gave me below message.

# swap -a /dev/zvol/dsk/rpool/swap
/dev/zvol/dsk/rpool/swap is in use for live upgrade -. Please see ludelete(1M).

Okay, so this was first time ever that happened to me. Well after scratching my head on wall for a while I got the answer.

The swap -a attempt might fail if the swap area is already listed in /etc/vfstab or is in use by Live Upgrade. In this case, use the swapadd feature instead.

# /sbin/swapadd

# swap -l
swapfile dev swaplo blocks free
/dev/zvol/dsk/rpool/swap 256,1 16 2097136 2097136

# top -clast pid: 13969; load avg: 2.34, 2.66, 2.84; up 25+16:07:28
1321 processes: 1255 sleeping, 64 zombie, 2 on cpu
CPU states: % idle, % user, % kernel, % iowait, % swap
Memory: 96G phys mem, 12G free mem, 64G swap, 64G free swap

All right then, sometimes it’s good to scratch your head on wall for a while… :) isn’t it?

HTH