After a long time... Yes, really after a long time but there is a reason to that. The reason is, I recently started working on Cloud Computing & relatively I'm pretty new to this area and hence was little busy understanding the concepts and preparing for some POCs & possible deployments.
In my opinion, overall getting started with Cloud Administration is easy if your UNIX, Virtualization & Networking concepts are clear & strong.
As my first cloud assignment I started working with an Open Source tool known as Eucalyptus (Elastic Utility Computing Architecture for Linking Your Programs to Useful Systems)
I found Eucalyptus very useful and though to share "Eucalyptus Installation & Configuration Documentation" with community for future reference.
Note: This document is a private document hence printing/downloading/copying of this document is restricted.
Before jump into "How-To" lets get introduced to the tool -
Eucalyptus is an open-source software platform that implements IaaS-style cloud computing using the existing Linux-based infrastructure found in the modern data center. It is interface compatible with Amazon's AWS making it possible to move workloads between AWS and the data center without modifying the code that implements them. Eucalyptus also works with most of the currently available Linux distributions including Ubuntu, Red Hat Enterprise Linux (RHEL), CentOS, SUSE Linux Enterprise Server (SLES), openSUSE, Debian and Fedora. Similarly, Eucalyptus can use a variety of virtualization technologies including VMware, Xen, and KVM to implement the cloud abstractions it supports.
Eucalyptus Feature Highlights:
Support for Amazon AWS (EC2, S3, and EBS)
Includes Walrus: an Amazon S3 interface-compatible storage manager
Added support for elastic IP assignment
Web-based interface for cloud configuration
Image registration and image attribute manipulation
Configurable scheduling policies and SLAs
Support for multiple hypervisor technologies within the same cloud
Benefits of Eucalyptus:
Build a private cloud that enables you to “cloud-burst” into Amazon AWS
Allows a cloud to be easily deployed on all types of legacy hardware and software
Customers can leverage the development strength of our worldwide user community
Eucalyptus is compatible with multiple distributions of Linux
Eucalyptus also supports the commercial Linux distributions: Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES)
Benefits of Eucalyptus for IT administrators:
Delivers a self-service provisioning IT infrastructure to end users that require IT resources quickly
Maintains existing infrastructure with no additional capital expense and reduces operating expense
Keeps critical data behind the firewall
Technology is an overlay to the existing hardware and software infrastructure, not a replacement
Avoids lock-in to a 3rd party public cloud vendor
Enables easy transitions back and forth between private and public clouds
Well, I hope now everyone know what the tool is all about, why to make use of it and benefits of deploying this tool. Now let's jump into document & have a happy Eucalyptus configuration.
Direct URL to Documentation - http://www.scribd.com/full/45951224?access_key=key-10u1jf5m7ld06ys640uf
Here you go, An Embedded Version -
Installing Configuring Eucalyptus 2.0.0 on CentOS 5.5
Hope this helps!
Currently, I'm working on a POC for building private cloud using tool known as "OpenQRM" and I'm almost completed it... I'm hoping to share a nice document on this subject soon! Till then enjoy & have a Happy Christmas!!!
Hello Friends, This is Nilesh Joshi from Pune, India. By profession I am an UNIX Systems Administrator and have proven career track on UNIX Systems Administration. This blog is written from both my research and my experience. The methods I describe herein are those that I have used and that have worked for me. It is highly recommended that you do further research on this subject. If you choose to use this document as a guide, you do so at your own risk. I wish you great success.
Find it
Sunday, December 26, 2010
Tuesday, October 19, 2010
Moving to POWER7 from SPARC
Last weekend I was reading a very interesting thread about moving to POWER7 from SPARC. I found it very interesting hence thought to share it with you.
There is an attention-grabbing discussion going on under LinkedIn. The subject is –
I think this is quite exciting topic or rather very strategic decision to make so I decided to have a brief summary note on it.
The discussion starts with question –
“My employer is considering seriously moving to POWER7 from SPARC as we retire EOSL (End of Service Life) hardware. Has anyone considered such a move or made the move from POWER to SPARC? “
Very difficult to answer isn’t it? Yes, it is!
I’m just trying to summarize the comments came from singular experts.
- There is a very valid point I agree upon - I would suggest not changing everything unless there is a very compelling reason. Point to ponder that if there is a particular need to run AIX, you can just integrate an AIX servers or cluster into your data centre without changing everything else which will be less overhead and sensible decision. Though this is not a technical point however first thought in my mind if my management ask me to do so.
- Datacenter power usage, cooling and space point of view - If one particular is struggling with power, cooling and space issue @ datacenter then consider mixing T- and M-series SPARC systems. T-series are great as power savers, they work best with highly parallel applications. M-series work great on single-threaded apps. The T-series are very efficient on power. However they do not do well on single threaded applications. For those you need to stick with M series.
- Another point - cost factor point of view - Solaris has x86 (Intel/AMD architecture), which is an option we don't have with AIX. For raw processing power and large memory footprints, Solaris 10 on Nehalem Intel is very motivating. You don't get all the RAS features of SPARC hardware, but if you have a load balanced applications or edge layer you can move there it can be a great fit. Also Solaris Supports x64 CPUs for excellent price/performance.
- The very vital point - The recent SPARC T3 servers aka “Rainbow Falls” that were announced last week at Oracle OpenWorld, that POWER7 isn't as desirable of a platform. Considering that a SPARC T3-4 can perform as well and in many benchmarks better than a 4 socket POWER7 box, but at considerably lower TCO, I don't see the point in the IBM/AIX/POWER route.
The SPARC T3 processor has the following specifications:
- 16 Cores x 8 Threads = 128 Threads Running at 1.65Ghz
- 2 x Execution Units per Core with 4 x Threads Each
- 1 x Floating Unit and 1 x Crypto Unit per Core
- 6MB of Shared L2 Cache
- 2 x DDR3 Memory Controllers
- 6 x Coherency Links for Glue-Less SMP up to 4 Sockets
- 2 x 10GbE NIU Controllers
- 2 x PCI-E 2.0 Controllers -> 8GB/s Bi-Directional I/O Each
Not done yet! Here is more details of various flavors of T3.
- Licensing factor - IBM will charge you for licenses left and right for each feature, especially on the virtualization front (LPARs, MPARs, and WPARs,), not to mention all the external components (HMCs, etc.). As where you can use Oracle VM Server for SPARC (LDoms) for free with the server and only pay for a RTU and support for S10 or S11 once for the whole machine (you can have hundreds or thousands of guests for no additional charge)! And don't forget that Solaris Containers are free and available on x86 and SPARC. Plus they can be used in LDoms(T-Series) and Dynamic System Domains (M-Series) for free! Oracle core licensing factor on SPARC T3 is 0.25
- AIX has no equivalent technology to ZFS.
- Solaris can scale to >64 CPUs to solve extremely large problems.
- Point to be noted - AIX on POWER is a good platform, I don't want to badmouth AIX. But here are the two biggest issues with the platform:
1. Costs
2. Finding enough AIX folks to support you! [It doesn’t mean that out there lot of resources available for supporting Solaris, I mean "GOOD RESOURCES"…]
- Punch line - SPARC is not dead, actually is more alive than ever and Solaris is the most advanced OS in the market: why change ?
- Virtualization point of view – AIX got LiveMotion! But not Sun BUT---BUT there are ways. Lets discuss it in details.
You can migrate your LDoms in a cold or warm migration, look out for LDoms 1.3 Admin guide in the chapter "Migrating Logical Domains", in particular the section on "Migrating an Active Domain" on page 129. (Oracle VM Server for SPARC 2.0). With the warm migration, the LDom is suspended and moved within seconds->minutes depending on the size and network bandwidth. As for rebooting service domains, if you're on something like a T5240 or T5440, you can use external boot storage (SAN/JBOD/iSCSI) to keep the second service domain up and have enough PCI-E slots for redundancy. By splitting your redundancy (network and storage) between them, your LDom guests will continue running with IPMP and MPXIO. So no down-time. FYI, even if you reboot the Primary domain without a secondary service domain, the guests will continue to run and wait for the Primary to return, so all is not lost. The only way you'd lose all of your LDoms is if you lose power or reset the SC. You can also use Solaris Cluster to automate the migration or fail-over of LDom guests.
The bad thing about the IBM Power VM setup is the following:
1. High-Over Head! The VIO overhead can easily consume over 40%-50% of your resources! As where on a T-series with LDoms, it's one core for the Primary domain and less than 10% overhead for the networking and storage I/O virtualization. The CPU threads are partitioned at the hypervisor and CPU level. RAM is virtualized at the hypervisor and MMU level. And on the M-series with Dynamic Domains you have 0% overhead because the CPU, Memory, and I/O are electrically partitioned on the centerplane. Even Solaris Containers have extremely low overhead, usually less than 5%. So I can get more out of my Solaris servers than you can in the IBM PowerVM world.
2. If there is a fault on a CPU or memory module, you can take out multiple LPAR/WPARs. As where on M-series with Dynamic Domains, faults like that would only affect the Domain the CMU module was on. You can even do a mirrored Memory configuration on the M-Series to be fully fault tolerant. Not to mention that the Dynamic Domains are electrically isolated, something you don't get anywhere else.
3. Costs, you have to get licenses to enable PowerVM features and they add up quickly. As where LDoms and Dynamic Domains are free with the hardware.
NOTE: In AIX WPAR (equivalent to Sun Container) can be migrated on the fly BUT on Solaris Containers cannot be migrated on the fly BUT not to worry it is something Sun-Oracle is working on and we'll probably see down the road. As of now for live migrating something around, I would use an LDom for now!
Not biased about AIX POWER architecture as this is a very vital decision to make and at the same time very difficult situation to conclude on but taking help of above points what I can think of – I’ll stick to SPARC architecture!
Your comments on this will be very much appreciated.
There is an attention-grabbing discussion going on under LinkedIn. The subject is –
“Moving to POWER7 from SPARC”
I think this is quite exciting topic or rather very strategic decision to make so I decided to have a brief summary note on it.
The discussion starts with question –
“My employer is considering seriously moving to POWER7 from SPARC as we retire EOSL (End of Service Life) hardware. Has anyone considered such a move or made the move from POWER to SPARC? “
Very difficult to answer isn’t it? Yes, it is!
I’m just trying to summarize the comments came from singular experts.
- There is a very valid point I agree upon - I would suggest not changing everything unless there is a very compelling reason. Point to ponder that if there is a particular need to run AIX, you can just integrate an AIX servers or cluster into your data centre without changing everything else which will be less overhead and sensible decision. Though this is not a technical point however first thought in my mind if my management ask me to do so.
- Datacenter power usage, cooling and space point of view - If one particular is struggling with power, cooling and space issue @ datacenter then consider mixing T- and M-series SPARC systems. T-series are great as power savers, they work best with highly parallel applications. M-series work great on single-threaded apps. The T-series are very efficient on power. However they do not do well on single threaded applications. For those you need to stick with M series.
- Another point - cost factor point of view - Solaris has x86 (Intel/AMD architecture), which is an option we don't have with AIX. For raw processing power and large memory footprints, Solaris 10 on Nehalem Intel is very motivating. You don't get all the RAS features of SPARC hardware, but if you have a load balanced applications or edge layer you can move there it can be a great fit. Also Solaris Supports x64 CPUs for excellent price/performance.
- The very vital point - The recent SPARC T3 servers aka “Rainbow Falls” that were announced last week at Oracle OpenWorld, that POWER7 isn't as desirable of a platform. Considering that a SPARC T3-4 can perform as well and in many benchmarks better than a 4 socket POWER7 box, but at considerably lower TCO, I don't see the point in the IBM/AIX/POWER route.
The SPARC T3 processor has the following specifications:
- 16 Cores x 8 Threads = 128 Threads Running at 1.65Ghz
- 2 x Execution Units per Core with 4 x Threads Each
- 1 x Floating Unit and 1 x Crypto Unit per Core
- 6MB of Shared L2 Cache
- 2 x DDR3 Memory Controllers
- 6 x Coherency Links for Glue-Less SMP up to 4 Sockets
- 2 x 10GbE NIU Controllers
- 2 x PCI-E 2.0 Controllers -> 8GB/s Bi-Directional I/O Each
Not done yet! Here is more details of various flavors of T3.
- Licensing factor - IBM will charge you for licenses left and right for each feature, especially on the virtualization front (LPARs, MPARs, and WPARs,), not to mention all the external components (HMCs, etc.). As where you can use Oracle VM Server for SPARC (LDoms) for free with the server and only pay for a RTU and support for S10 or S11 once for the whole machine (you can have hundreds or thousands of guests for no additional charge)! And don't forget that Solaris Containers are free and available on x86 and SPARC. Plus they can be used in LDoms(T-Series) and Dynamic System Domains (M-Series) for free! Oracle core licensing factor on SPARC T3 is 0.25
- AIX has no equivalent technology to ZFS.
- Solaris can scale to >64 CPUs to solve extremely large problems.
- Point to be noted - AIX on POWER is a good platform, I don't want to badmouth AIX. But here are the two biggest issues with the platform:
1. Costs
2. Finding enough AIX folks to support you! [It doesn’t mean that out there lot of resources available for supporting Solaris, I mean "GOOD RESOURCES"…]
- Punch line - SPARC is not dead, actually is more alive than ever and Solaris is the most advanced OS in the market: why change ?
- Virtualization point of view – AIX got LiveMotion! But not Sun BUT---BUT there are ways. Lets discuss it in details.
You can migrate your LDoms in a cold or warm migration, look out for LDoms 1.3 Admin guide in the chapter "Migrating Logical Domains", in particular the section on "Migrating an Active Domain" on page 129. (Oracle VM Server for SPARC 2.0). With the warm migration, the LDom is suspended and moved within seconds->minutes depending on the size and network bandwidth. As for rebooting service domains, if you're on something like a T5240 or T5440, you can use external boot storage (SAN/JBOD/iSCSI) to keep the second service domain up and have enough PCI-E slots for redundancy. By splitting your redundancy (network and storage) between them, your LDom guests will continue running with IPMP and MPXIO. So no down-time. FYI, even if you reboot the Primary domain without a secondary service domain, the guests will continue to run and wait for the Primary to return, so all is not lost. The only way you'd lose all of your LDoms is if you lose power or reset the SC. You can also use Solaris Cluster to automate the migration or fail-over of LDom guests.
The bad thing about the IBM Power VM setup is the following:
1. High-Over Head! The VIO overhead can easily consume over 40%-50% of your resources! As where on a T-series with LDoms, it's one core for the Primary domain and less than 10% overhead for the networking and storage I/O virtualization. The CPU threads are partitioned at the hypervisor and CPU level. RAM is virtualized at the hypervisor and MMU level. And on the M-series with Dynamic Domains you have 0% overhead because the CPU, Memory, and I/O are electrically partitioned on the centerplane. Even Solaris Containers have extremely low overhead, usually less than 5%. So I can get more out of my Solaris servers than you can in the IBM PowerVM world.
2. If there is a fault on a CPU or memory module, you can take out multiple LPAR/WPARs. As where on M-series with Dynamic Domains, faults like that would only affect the Domain the CMU module was on. You can even do a mirrored Memory configuration on the M-Series to be fully fault tolerant. Not to mention that the Dynamic Domains are electrically isolated, something you don't get anywhere else.
3. Costs, you have to get licenses to enable PowerVM features and they add up quickly. As where LDoms and Dynamic Domains are free with the hardware.
NOTE: In AIX WPAR (equivalent to Sun Container) can be migrated on the fly BUT on Solaris Containers cannot be migrated on the fly BUT not to worry it is something Sun-Oracle is working on and we'll probably see down the road. As of now for live migrating something around, I would use an LDom for now!
Not biased about AIX POWER architecture as this is a very vital decision to make and at the same time very difficult situation to conclude on but taking help of above points what I can think of – I’ll stick to SPARC architecture!
Your comments on this will be very much appreciated.
Thursday, October 14, 2010
Solaris Flash Archives
Flash images or flar image is very useful in situations where you need cloning/imaging or crashed server recovery. The flarcreate command creates a flash archive. A flash archive can be created on a system that is running a UFS root file system or a ZFS root file system. A flash archive of a ZFS root pool contains the entire pool hierarchy except for the swap and dump volumes and any excluded datasets. The swap and dump volumes are created when the flash archive is installed.
NOTE: By default, the flarcreate command ignores items that are located in "swap" partitions.
Let's see how we can work with flar image creation.
Create the archive:
For UFS:
# flarcreate -n "Solaris 10 10/09 build" -S -c -x /var/tmp/ /var/tmp/S10-1009.ufs.archive.sun4u-`date +'%Y%m%d%H%M'`
For ZFS:
# flarcreate -n "Solaris 10 10/09 build" -S -c /var/tmp/S10-1009.zfs.archive.sun4u-`date +'%Y%m%d%H%M'`
Where -
The "-n Solaris 10 10/09 build" implants a name into the FLAR image. The name should be something unique and meaningful to better identify it as the FLAR image for the system.
The "-x /var/tmp/" option causes the /var/tmp/ directory and its contents to be excluded from the FLAR image since it will not be needed in the FLAR image.
-S option causes to skip the disk space check and do not write archive size data to the archive. Without -S, flarcreate builds a compressed archive in memory before writing the archive to disk, to determine the size of the archive. The result of the use of -S is a significant decrease in the time it takes to create an archive.
-c Tells flar to compress the archive as it's writing it.
E.g. -
# time flarcreate -n "Solaris 10 10/09 build" -S -c /var/tmp/S10-1009.zfs.archive.sun4u-`date '+%m-%d-%y'`
Full Flash
Checking integrity...
Integrity OK.
Running precreation scripts...
Precreation scripts done.
Creating the archive...
Archive creation complete.
Running postcreation scripts...
Postcreation scripts done.
Running pre-exit scripts...
Pre-exit scripts done.
real 19m58.57s
user 13m42.99s
sys 1m55.48s
# ls -l /var/tmp/S10-1009.zfs.archive.sun4u*
-rw-r--r-- 1 root root 5339709933 Oct 14 04:54 /var/tmp/S10-1009.zfs.archive.sun4u-10-14-10
# flar info /var/tmp/S10-1009.zfs.archive.sun4u-10-14-10
archive_id=2f27a01690ce4fcaf398e638fcdcb66e
files_archived_method=cpio
creation_date=20101014093417
creation_master=XXXXXX
content_name=Solaris 10 10/09 build
creation_node=XXXXXXXX
creation_hardware_class=sun4u
creation_platform=SUNW,Sun-Fire-V240
creation_processor=sparc
creation_release=5.10
creation_os_name=SunOS
creation_os_version=Generic_142900-09
rootpool=rpool
bootfs=rpool/ROOT/s10s_u8wos_08a_Pre-patch
snapname=zflash.101014.04.10
files_compressed_method=compress
content_architectures=sun4c,sun4d,sun4m,sun4u,sun4s,sun4us
type=FULL
Also we can have a small shell script to create flar image -
#!/bin/sh
echo
echo Enter image name, i.e. Solarisbuild e.g. S10-1009.ufs.archive.sun4v
read ANS
echo "Image Name: ${ANS}" > /etc/image_catalog
echo "Image Created on: `date`" >> /etc/image_catalog
echo "Image Created by: `/usr/ucb/whoami` on `hostname`" >> /etc/image_catalog
#
# Clean up wtmpx so that new machine won't have last logins
#
cat /dev/null > /var/adm/wtmpx
#
# Now create flar, excluding -x /var/tmp/
#
flarcreate -n ${ANS} -c -a `/usr/ucb/whoami` -x /var/tmp/ /var/tmp/${ANS}_`date +'%Y%m%d%H%M'`
NOTE: By default, the flarcreate command ignores items that are located in "swap" partitions.
Let's see how we can work with flar image creation.
Create the archive:
For UFS:
# flarcreate -n "Solaris 10 10/09 build" -S -c -x /var/tmp/ /var/tmp/S10-1009.ufs.archive.sun4u-`date +'%Y%m%d%H%M'`
For ZFS:
# flarcreate -n "Solaris 10 10/09 build" -S -c /var/tmp/S10-1009.zfs.archive.sun4u-`date +'%Y%m%d%H%M'`
Where -
The "-n Solaris 10 10/09 build" implants a name into the FLAR image. The name should be something unique and meaningful to better identify it as the FLAR image for the system.
The "-x /var/tmp/" option causes the /var/tmp/ directory and its contents to be excluded from the FLAR image since it will not be needed in the FLAR image.
-S option causes to skip the disk space check and do not write archive size data to the archive. Without -S, flarcreate builds a compressed archive in memory before writing the archive to disk, to determine the size of the archive. The result of the use of -S is a significant decrease in the time it takes to create an archive.
-c Tells flar to compress the archive as it's writing it.
E.g. -
# time flarcreate -n "Solaris 10 10/09 build" -S -c /var/tmp/S10-1009.zfs.archive.sun4u-`date '+%m-%d-%y'`
Full Flash
Checking integrity...
Integrity OK.
Running precreation scripts...
Precreation scripts done.
Creating the archive...
Archive creation complete.
Running postcreation scripts...
Postcreation scripts done.
Running pre-exit scripts...
Pre-exit scripts done.
real 19m58.57s
user 13m42.99s
sys 1m55.48s
# ls -l /var/tmp/S10-1009.zfs.archive.sun4u*
-rw-r--r-- 1 root root 5339709933 Oct 14 04:54 /var/tmp/S10-1009.zfs.archive.sun4u-10-14-10
# flar info /var/tmp/S10-1009.zfs.archive.sun4u-10-14-10
archive_id=2f27a01690ce4fcaf398e638fcdcb66e
files_archived_method=cpio
creation_date=20101014093417
creation_master=XXXXXX
content_name=Solaris 10 10/09 build
creation_node=XXXXXXXX
creation_hardware_class=sun4u
creation_platform=SUNW,Sun-Fire-V240
creation_processor=sparc
creation_release=5.10
creation_os_name=SunOS
creation_os_version=Generic_142900-09
rootpool=rpool
bootfs=rpool/ROOT/s10s_u8wos_08a_Pre-patch
snapname=zflash.101014.04.10
files_compressed_method=compress
content_architectures=sun4c,sun4d,sun4m,sun4u,sun4s,sun4us
type=FULL
Also we can have a small shell script to create flar image -
#!/bin/sh
echo
echo Enter image name, i.e. Solaris
read ANS
echo "Image Name: ${ANS}" > /etc/image_catalog
echo "Image Created on: `date`" >> /etc/image_catalog
echo "Image Created by: `/usr/ucb/whoami` on `hostname`" >> /etc/image_catalog
#
# Clean up wtmpx so that new machine won't have last logins
#
cat /dev/null > /var/adm/wtmpx
#
# Now create flar, excluding -x /var/tmp/
#
flarcreate -n ${ANS} -c -a `/usr/ucb/whoami` -x /var/tmp/ /var/tmp/${ANS}_`date +'%Y%m%d%H%M'`
Thursday, September 9, 2010
Oracle Solaris 10 9/10 Release = Oracle Solaris 10 Update 9 released on Sept 8, 2010
Yesterday, September 8, 2010 Oracle officially announced Oracle Solaris 10 9/10, Oracle Solaris Cluster 3.3 and Oracle Solaris Studio12.2. For now we will be concentrating on Oracle Solaris 10 9/10.
Here is a history of Solaris 10 Update releases –
Solaris 10 3/05 = Solaris 10 (FCS == First Customer Ship)
Solaris 10 01/06 = Solaris 10 Update 1
Solaris 10 06/06 = Solaris 10 Update 2
Solaris 10 11/06 = Solaris 10 Update 3
Solaris 10 8/07 Release = Solaris 10 Update 4
Solaris 10 5/08 Release = Solaris 10 Update 5
Solaris 10 10/08 Release = Solaris 10 Update 6
Solaris 10 5/09 Release = Solaris 10 Update 7
Solaris 10 10/09 Release = Solaris 10 Update 8 >>>> Currently we are here.
Oracle Solaris 10 9/10 Release = Oracle Solaris 10 Update 9 >>>> New Release
So what Solaris 10 U9 includes, let’s take quick tour –
There are some drastic changes under this update –
For more information - http://dlc.sun.com/pdf/821-1840/821-1840.pdf
So what you’re waiting for let grab one for testing - http://www.oracle.com/technetwork/server-storage/solaris/downloads/index.html
HTH,
Tx,
--Nilesh--
Solaris 10 01/06 = Solaris 10 Update 1
Solaris 10 06/06 = Solaris 10 Update 2
Solaris 10 11/06 = Solaris 10 Update 3
Solaris 10 8/07 Release = Solaris 10 Update 4
Solaris 10 5/08 Release = Solaris 10 Update 5
Solaris 10 10/08 Release = Solaris 10 Update 6
Solaris 10 5/09 Release = Solaris 10 Update 7
Solaris 10 10/09 Release = Solaris 10 Update 8 >>>> Currently we are here.
Oracle Solaris 10 9/10 Release = Oracle Solaris 10 Update 9 >>>> New Release
So what Solaris 10 U9 includes, let’s take quick tour –
- The most awaited, Oracle Solaris Containers now provide enhanced “P2V” (Physical to Virtual) capabilities to allow customers to seamlessly move from existing Oracle Solaris 10 physical systems to virtual containers quickly and easily. At our project we developed a custom method to perform p2v from Solaris8/9 to Solaris 10 Container and to be honest we was really looking forward for this feature.
- Host ID Emulation - Migration of a physical Solaris 10 machine into a Zone with support for the HostID will allow more network management platforms to be virtualized while still retaining their licensing features.
- Oracle 11g Release 2 Support
- Networking and database optimizations for Oracle Real Application Clusters (Oracle RAC).
- Increased reliability for virtualized Solaris instances when deployed using Oracle VM for SPARC, also known as Logical Domains.
- ZFS device replacement enhancements - namely autoexpand
- some changes to the zpool list command
- Holding ZFS snapshots
- Triple parity RAID-Z (raidz3)
- The logbias property
- Log device removal - at last
- ZFS storage pool recovery
- New ZFS system process – In this release, each storage pool has an associated process, zpool-poolname
- Splitting a mirrored ZFS storage pool (zpool split)
--Nilesh--
Tuesday, August 31, 2010
The most annoying error - cannot mount '/home': directory is not empty
Was working on migrating few SVM metadevice based filesystems to ZFS filesystem on Solaris 10 U8 and for the reason I have created ZFS filesystems with temporary mountpoint under /mnt like:
# zfs create –o mountpoint=/mnt/home –o quota=2g rpool/home
After creating this filesystem I’ll sync the data from metadevice based file system to ZFS based filesystem residing under /mnt like:
# rsync -axP --delete /home/ /mnt/home/
Once the data migrated I’ll un-mount the metadevice for /home and rearrange the mount point for ZFS filesystem back to /home using:
# zfs set mountpoint=/home rpool/home
While performing above command execution I get the error –
# zfs mount -a
cannot mount '/home': directory is not empty
This error may occur due to various reasons like one but not the last is auto mount daemon and so on which I’m not aware of… :)
I got the rid of this pesky issue performing “Overlay mount”
# zfs mount -O rpool/home
BTW then what is overlay mount? – Overlay mount allow the file system to be mounted over an existing mount point, making the underlying file system inaccessible. If a mount is attempted on a pre-existing mount point without setting this flag, the mount will fail, producing the error "device busy" or in ZFS terms “directory is not empty”.
This solution seems to be working just fine for me but I cannot guaranty that same will work for you as I said before there are various factors that leads to such issue.
HTH
# zfs create –o mountpoint=/mnt/home –o quota=2g rpool/home
After creating this filesystem I’ll sync the data from metadevice based file system to ZFS based filesystem residing under /mnt like:
# rsync -axP --delete /home/ /mnt/home/
Once the data migrated I’ll un-mount the metadevice for /home and rearrange the mount point for ZFS filesystem back to /home using:
# zfs set mountpoint=/home rpool/home
While performing above command execution I get the error –
# zfs mount -a
cannot mount '/home': directory is not empty
This error may occur due to various reasons like one but not the last is auto mount daemon and so on which I’m not aware of… :)
I got the rid of this pesky issue performing “Overlay mount”
# zfs mount -O rpool/home
BTW then what is overlay mount? – Overlay mount allow the file system to be mounted over an existing mount point, making the underlying file system inaccessible. If a mount is attempted on a pre-existing mount point without setting this flag, the mount will fail, producing the error "device busy" or in ZFS terms “directory is not empty”.
This solution seems to be working just fine for me but I cannot guaranty that same will work for you as I said before there are various factors that leads to such issue.
HTH
Monday, August 9, 2010
alloc: /: file system full
Today I came across with a strange issue of file system full with error -
Aug 9 00:17:41 server1 ufs: [ID 845546 kern.notice] NOTICE: alloc: /: file system full
When I looked at the top disk space consumers I found nothing useful.
# df -h | sort -rnk 5
/dev/md/dsk/d0 3.0G 2.9G 0K 100% /
/dev/md/dsk/d3 2.0G 1.5G 404M 80% /var
/dev/md/dsk/d30 469M 330M 93M 79% /opt
/dev/md/dsk/d6 992M 717M 215M 77% /home
/dev/md/dsk/d33 752M 494M 198M 72% /usr/local/install
[...]
After doing a du on whole filesystem I can see it is showing 2.5G only and df showing 2.9G consumed space.
# du -shd /
2.5G
I realized few days back I came across same issue on ZFS filesystem hosting oracle DB and below understanding helped me there.
Normally, If filesystem is full, then look around in the directories that will be hidden by mounted filesystems in higher init states or see if any files that are eating up the disk space, in case if you get nothing useful from this exercise then one of the things to check is the open files and consider what has been cleaned up. Sometimes, if an open file is emptied or unlinked from the directory tree the disk space is not de-allocated until the owning process has been terminated or restarted. The result is an unexplainable loss of disk space. If this is the cause a reboot would clear it up. If you can't reboot consider any process that would be logging to that partition as a suspect and check all of your logs for any entries that imply rapid errors in a process.
In my case, reboot was not possible as the server caused file system full
# lsof +aL1 /
lsof: WARNING: access /.lsof_server1: No such file or directory
lsof: WARNING: created device cache file: /.lsof_server1
lsof: WARNING: can't write to /.lsof_server1: No space left on device
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
scp 16472 root 4r VREG 85,0 238616064 0 119696 / (/dev/md/dsk/d0)
scp 22154 root 4r VREG 85,0 238213120 0 119677 / (/dev/md/dsk/d0)
Where:
``+L1'' will select open files that have been unlinked. A specification of the form ``+aL1'' will select unlinked open files on the specified file system.
I got the processes ID's via lsof, after verifying the processes I killed them and suddenly it has released ~450MB space.
# df -kh | sort -rnk 5
/dev/md/dsk/d0 3.0G 2.5G 418M 86% /
/dev/md/dsk/d3 2.0G 1.5G 406M 80% /var
/dev/md/dsk/d30 469M 331M 91M 79% /opt
/dev/md/dsk/d6 992M 717M 215M 77% /home
/dev/md/dsk/d33 752M 494M 198M 72% /usr/local/install
Hope this helps.
Aug 9 00:17:41 server1 ufs: [ID 845546 kern.notice] NOTICE: alloc: /: file system full
When I looked at the top disk space consumers I found nothing useful.
# df -h | sort -rnk 5
/dev/md/dsk/d0 3.0G 2.9G 0K 100% /
/dev/md/dsk/d3 2.0G 1.5G 404M 80% /var
/dev/md/dsk/d30 469M 330M 93M 79% /opt
/dev/md/dsk/d6 992M 717M 215M 77% /home
/dev/md/dsk/d33 752M 494M 198M 72% /usr/local/install
[...]
After doing a du on whole filesystem I can see it is showing 2.5G only and df showing 2.9G consumed space.
# du -shd /
2.5G
I realized few days back I came across same issue on ZFS filesystem hosting oracle DB and below understanding helped me there.
Normally, If filesystem is full, then look around in the directories that will be hidden by mounted filesystems in higher init states or see if any files that are eating up the disk space, in case if you get nothing useful from this exercise then one of the things to check is the open files and consider what has been cleaned up. Sometimes, if an open file is emptied or unlinked from the directory tree the disk space is not de-allocated until the owning process has been terminated or restarted. The result is an unexplainable loss of disk space. If this is the cause a reboot would clear it up. If you can't reboot consider any process that would be logging to that partition as a suspect and check all of your logs for any entries that imply rapid errors in a process.
In my case, reboot was not possible as the server caused file system full
# lsof +aL1 /
lsof: WARNING: access /.lsof_server1: No such file or directory
lsof: WARNING: created device cache file: /.lsof_server1
lsof: WARNING: can't write to /.lsof_server1: No space left on device
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
scp 16472 root 4r VREG 85,0 238616064 0 119696 / (/dev/md/dsk/d0)
scp 22154 root 4r VREG 85,0 238213120 0 119677 / (/dev/md/dsk/d0)
Where:
``+L1'' will select open files that have been unlinked. A specification of the form ``+aL1
I got the processes ID's via lsof, after verifying the processes I killed them and suddenly it has released ~450MB space.
# df -kh | sort -rnk 5
/dev/md/dsk/d0 3.0G 2.5G 418M 86% /
/dev/md/dsk/d3 2.0G 1.5G 406M 80% /var
/dev/md/dsk/d30 469M 331M 91M 79% /opt
/dev/md/dsk/d6 992M 717M 215M 77% /home
/dev/md/dsk/d33 752M 494M 198M 72% /usr/local/install
Hope this helps.
Thursday, July 8, 2010
UFS to ZFS using LiveUpgrade, all went well but it made me sad when I deleted the snapshots/clones for zonepath
UFS to ZFS using LiveUpgrade, all went well but it made me sad when I deleted the snapshots/clones for zonepath.
Few weeks ago I made a mistake, it was height of stupidity as an experienced and sensible system administrator. Few weeks back while migrating my UFS based system to ZFS I landed up in a situation where it made me upset for a long time. Below is the brief history on problem.
I'm currently working on UFS to ZFS migration on systems having zones which are already migrated to ZFS zonepath/zone root.
In last upgrade/migration, the mistake I made is - I deleted the snapshots/clones made by LiveUpgrade for zone's zonepath/zoneroot.
The story behind is - When we create a ABE which will boot to ZFS LiveUpgrade program creates a snapshot and that snapshot creates a clone of /[zonename]/zonepath as shown below -
[...]
Creating snapshot for zone1/zonepath on mailto:zone1/zonepath@s10s_u8wos_08a].
Creating clone for [zone1/zonepath@s10s_u8wos_08a] on zone1/zonepath-s10s_u8wos_08a.
Creating snapshot for zone2/zonepath on mailto:zone2/zonepath@s10s_u8wos_08a].
Creating clone for [zone2/zonepath@s10s_u8wos_08a] on zone2/zonepath-s10s_u8wos_08a.
[...]
After creating ABE for ZFS we can luactivate it and boot into ZFS.
# luactivate s10s_u8wos_08a
Till this point everything Worked fine.
# init 6
# lustatus
Boot Environment Is Active Active Can Copy
Name Complete Now On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
Sol10u8 yes no no yes -
s10s_u8wos_08a yes yes yes no -
From above output you can see I'm in ZFS boot environment.
Now the real stupidity stunt starts onwards below portion. Unfortunately and lack of study before task execution, I deleted snapshots/clones using -
# zfs list -H -t snapshot | cut -f 1
zone1/zonepath-Sol10u8@s10s_u8wos_08a
zone2/zonepath-Sol10u8@s10s_u8wos_08a
zone3/zonepath-Sol10u8@s10s_u8wos_08a
# for snapshot in `zfs list -H -t snapshot
cut -f 1`; do zfs destroy -R -f $snapshot;done
From this point onward I lost my zonepaths and could not able to boot my zones. WOW... Horrible Sunday starts here!
After few weeks after I found the answer as described below -
Solution -
Thanks to Alex and Julien for their suggestions and help.
I removed/destroyed the snapshots/clones meant for new ZFS BE
After this point onward tried mounting the UFS BE but due to unclean zonepaths it failed and left nebulous mouintpoints. To clear those mount points you rebooted the server.
This point onward, after analyzing the ICF* files decided to re-created the zonepath using old UFs BE. Command zpool history helped me here.
# zfs clone zone1/zonepath-d100@s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set mountpoint=/zone1/zonepath-s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set zpdata:rbe=s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set zpdata:zn=zone1 zone1/zonepath-s10s_u8wos_08a
# zfs set mountpoint=/zone1/zonepath-s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set canmount=off zone1/zonepath-s10s_u8wos_08a
# zfs set canmount=on zone1/zonepath-s10s_u8wos_08a
# zfs rename zone1/zonepath-s10s_u8wos_08a zone1/zonepath
# zfs set mountpoint =/zone1/zoenpath zone1/zonepath
likewise one by one recreated/recovered zonepaths snapshots/clones for all the zones which were at fault.
This way, zonepath backed again and fall back to UFS BE went soft however the main thing here, was the patching needs to be re-done here as the zonepath were re-created using unpatched UFS BE.
After backout, deleted the ZFS BE & this point onwards I created new ZFS BE and patched it once again.
I hope no one is stupid like me to make such a mistakes but if made then here you go!
Few weeks ago I made a mistake, it was height of stupidity as an experienced and sensible system administrator. Few weeks back while migrating my UFS based system to ZFS I landed up in a situation where it made me upset for a long time. Below is the brief history on problem.
I'm currently working on UFS to ZFS migration on systems having zones which are already migrated to ZFS zonepath/zone root.
In last upgrade/migration, the mistake I made is - I deleted the snapshots/clones made by LiveUpgrade for zone's zonepath/zoneroot.
The story behind is - When we create a ABE which will boot to ZFS LiveUpgrade program creates a snapshot and that snapshot creates a clone of /[zonename]/zonepath as shown below -
[...]
Creating snapshot for zone1/zonepath on mailto:zone1/zonepath@s10s_u8wos_08a].
Creating clone for [zone1/zonepath@s10s_u8wos_08a] on zone1/zonepath-s10s_u8wos_08a.
Creating snapshot for zone2/zonepath on mailto:zone2/zonepath@s10s_u8wos_08a].
Creating clone for [zone2/zonepath@s10s_u8wos_08a] on zone2/zonepath-s10s_u8wos_08a.
[...]
After creating ABE for ZFS we can luactivate it and boot into ZFS.
# luactivate s10s_u8wos_08a
Till this point everything Worked fine.
# init 6
# lustatus
Boot Environment Is Active Active Can Copy
Name Complete Now On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
Sol10u8 yes no no yes -
s10s_u8wos_08a yes yes yes no -
From above output you can see I'm in ZFS boot environment.
Now the real stupidity stunt starts onwards below portion. Unfortunately and lack of study before task execution, I deleted snapshots/clones using -
# zfs list -H -t snapshot | cut -f 1
zone1/zonepath-Sol10u8@s10s_u8wos_08a
zone2/zonepath-Sol10u8@s10s_u8wos_08a
zone3/zonepath-Sol10u8@s10s_u8wos_08a
# for snapshot in `zfs list -H -t snapshot
cut -f 1`; do zfs destroy -R -f $snapshot;done
From this point onward I lost my zonepaths and could not able to boot my zones. WOW... Horrible Sunday starts here!
After few weeks after I found the answer as described below -
Solution -
Thanks to Alex and Julien for their suggestions and help.
I removed/destroyed the snapshots/clones meant for new ZFS BE
After this point onward tried mounting the UFS BE but due to unclean zonepaths it failed and left nebulous mouintpoints. To clear those mount points you rebooted the server.
This point onward, after analyzing the ICF* files decided to re-created the zonepath using old UFs BE. Command zpool history helped me here.
# zfs clone zone1/zonepath-d100@s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set mountpoint=/zone1/zonepath-s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set zpdata:rbe=s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set zpdata:zn=zone1 zone1/zonepath-s10s_u8wos_08a
# zfs set mountpoint=/zone1/zonepath-s10s_u8wos_08a zone1/zonepath-s10s_u8wos_08a
# zfs set canmount=off zone1/zonepath-s10s_u8wos_08a
# zfs set canmount=on zone1/zonepath-s10s_u8wos_08a
# zfs rename zone1/zonepath-s10s_u8wos_08a zone1/zonepath
# zfs set mountpoint =/zone1/zoenpath zone1/zonepath
likewise one by one recreated/recovered zonepaths snapshots/clones for all the zones which were at fault.
This way, zonepath backed again and fall back to UFS BE went soft however the main thing here, was the patching needs to be re-done here as the zonepath were re-created using unpatched UFS BE.
After backout, deleted the ZFS BE & this point onwards I created new ZFS BE and patched it once again.
I hope no one is stupid like me to make such a mistakes but if made then here you go!
Tuesday, July 6, 2010
ZFS Revisited
Understanding ZFS & ZFS ARC/L2ARC
It's being almost a year that I’m working on ZFS Filesystem Administration and just finished migrating all of our Solaris servers from UFS root to ZFS root, so now OS data and Application data resides on ZFS. The intention to write this document is to have some handy notes about ZFS & it’s cache mechanism which always plays a vital role in terms of system performance!
This is just a revisit the features/techniques of this great filesystem known as ZFS.
Let’s start from assumptions. In last one year I’ve experienced that there were some assumption that I and others were carrying but before we jump into anything assumption queue should get clear so let’s do that.
Some assumptions about ZFS –
• Tuning ZFS is Evil (TRUE)
• ZFS doesn’t require tuning (FALSE)
• ZFS is a memory hog (TRUE)
• ZFS is slow (FLASE)
• ZFS won’t allow corruption (FLASE)
Alright then, now we clear on few of assumptions and now known to the facts! Let’s take a look at few important features of ZFS –
• ZFS is the world’s first 128-bit file system and as such has a huge capacity.
• Capacity wise Single filesystems 512TB+ (theoretical 264 devices * 264 bytes)
• Trillions of files in a single file system (theoretical 248 files per fileset)
• ZFS is Transaction based, copy-on-write filesystem, so no "fsck" is required
• High Internal data redundancy (New RAID level called RAIDz)
• End-to-end checksum of all data/metadata with Strong algorithm (SHA-256) but CPU consuming sometimes! So can be turn off on data and ON on metadata. Checksums are used to validate blocks.
• Online integrity verification and reconstruction
• Snapshots, filesets, compression, encryption facilities and much more!
What is the ZFS filesystem capacity?
264 — Number of snapshots of any file system
248 — Number of entries in any individual directory
16 EB (264 bytes) — Maximum size of a file system
16 EB — Maximum size of a single file
16 EB — Maximum size of any attribute
256 ZB (278 bytes) — Maximum size of any zpool
256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system)
264 — Number of devices in any zpool
264 — Number of zpools in a system
264 — Number of file systems in a zpool
How about data redundancy in ZFS?
ZFS supports following RAID configurations
> Stripes (RAID-0)
> Mirroring (RAID-1)
> RAID-Z (Similar to RAID-5)
> RAID-Z2 (Double parity, similar to RAID-6)
Hey, what is it…WOW no FSCK needed? HOW?
Yes, it’s true. ZFS does not need fsck to correct filesystem errors/corruptions as due to COW behavior. ZFS maintains its records as a tree of blocks. Every block is accessible via a single block called the “uber-block”. When you change an existing block, instead of getting overwritten a copy of the data is made and then modified before being written to disk this is Copy on Write (COW). This ensures ZFS never overwrites live data. This guarantees the integrity of the file system as a system crash still leaves the on disk data in a completely consistent state. There is no need for fsck. Ever.
ZFS is self healing data feature?
Yes, provided that –
• If a Bad Block is found ZFS can repair it so long as it has another copy
• RAID-1 - ZFS can “heal” bad data blocks using the mirrored copy
• RAID-Z/Z2 - ZFS can “heal” bad data blocks using parity
Also note that self healing is avail for ZFS metadata but not to actual application data.
ZFS Best Practices –
• Tune recordsize only on fixed records DB files
• Mirror for performance
• 64-bit kernel (allows greater ZFS caches)
• configure swap (don't be scared by low memory)
• Don't slice up devices (confuses I/O scheduler)
• Isolate DB log writer if that is critical (use few devices)
• Separate Root pool (system's identify) and data pools (system's function)
• Keep pool below 80% full (helps COW)
• Usage of snapshots/clones for backup/DR purpose
Okay then, let’s talk about ARC first.
ARC stands for “Adaptive/Adjustable Replacement Cache” - ARC is a very fast cache located in the server’s memory (RAM). The amount of ARC available in a server is usually all of the memory except for 1GB or 3/4th of main memory whichever is greater so simple calculation shows 3GB on your 4GB machine.
Originally, ARC concepts has been first described and invented by two IBM researchers Megiddo and Modha in November 2003 but ZFS ARC is significantly modified version of original ARC design.
The major differences of ZFS ARC is –
• The ZFS ARC is variable in size and can react to the available memory & maybe that’s the reason it is called as “Adjustable Replacement Memory”. It can grow in size when memory is available or it can shrink in size when memory is needed for other processes/jobs.
• The designed proposed by Megiddo and Modha assumes the block size should be same but ZFS ARC works with multiple block sizes.
• Under ZFS ARC you can lock pages in the cache to excuse them from the removal. This prevents the cache to remove pages, that are currently in use. This feature is not in original ARC design.
ZFS ARC stores ZFS data and metadata information from all active storage pools in physical memory (RAM) by default as much as possible, except 1 GB of RAM or 3/4th of main memory BUT I would say this is just a thumb rule or theoretical rule and depending on the environment tuning needs to be done for better system performance. Consider limiting the maximum ARC memory footprint in the following situations:
• When a known amount of memory is always required by an application. Databases often fall into this category.
• On platforms that support dynamic reconfiguration of memory boards, to prevent ZFS from growing the kernel fence in onto all boards.
• A system that requires large memory pages might also benefit from limiting the ZFS cache, which tends to breakdown large pages into base pages.
• Finally, if the system is running another non-ZFS file system, in addition to ZFS, it is advisable to leave some free memory to host that other file system's caches.
The ARC grows and consumes memory on the theory that no need exists to return data to the system while there is still plenty of free memory. When the ARC has grown and near to go exceed memory pressure exists, for example, when a new application starts up, then the ARC releases its hold on memory. Lastly and very important to note that ZFS is not designed to steal memory from applications, ZFS is very innocent!
By default, UFS uses page caching managed virtual memory system however ZFS does not use the page caching except few type of files! ZFS use ARC. There can be only one ARC per system however caching policy can be change per dataset basis.
As I said before, to make sure application/databases has enough dedicated memory available you need to perform tuning or capping on ARC.
# prtconf | grep Mem
Memory size: 98304 Megabytes
# grep zfs /etc/system
set zfs:zfs_arc_min = 1073741824
set zfs:zfs_arc_max = 17179869184
So here, I’ve 96GB total physical memory and ARC has capped at 16G. So around ~17% has been capped for arc_max and 1G for arc_min
The following command gives the current memory size in bytes that is used by ZFS cache:
# kstat zfs::arcstats:size
module: zfs instance: 0
name: arcstats class: misc
size 15325351536
The thumb rules for tuning ARC will be –
• know your future application memory requirements say if it required 20% memory in overall then it makes sense to cap the ARC such that it does not consume more than the remaining 80% of memory.
• Understand and feel your existing applications well. If application known or indeed uses large memory pages then putting cap on ARC will be a bottleneck to that application as limiting the ARC prevents ZFS from breaking up the pages and fragmenting the memory. Limiting the ARC preserves the availability of large pages.
• It’s certainly not easy to tune ARC and need quite a bit deep understanding of applications and their needs though it always helps to have mentioned scripts handy and added to your tools/script depot - arc_summary.pl (By Ben Rockwood) & arcstat.pl (By Neelakanth Nadgir)
Now let’s have a small discussion about L2ARC
L2ARC is a new layer between Disk and the cache (ARC) in main memory for ZFS. It uses dedicated storage devices to hold cached data. The main role of this cache is to boost the performance of random read workloads. The intended L2ARC devices include 10K/15K RPM disks like short-stroked disks, solid state disks (SSD), and other media with substantially faster read latency than disk.
L2ARC, or Level 2 caching in ARC makes it possible to use a disk in addition to RAM for caching reads. This improves read performance, which can otherwise be slow because of the fragmentation caused by the copy-on-write model used by ZFS.
In actual L2ARC cache is supported in Solaris 10 update 6 however as far as my knowledge L2ARC is well supported and functional in Solaris 10 Update 8 and onwards.
The L2ARC attempts to cache data from the ARC before it is evicted so the L2ARC populates its cache by periodically reading data from the tail of the ARC. The data in the tail of the ARC is the data that hasn’t been used for while. I’m still reading through and understanding on how exactly L2ARC works and how it relates to ARC and all that stuffs.
Am I really need L2ARC?
It all depends on the situation, if you have a system where a lot of data is being read frequently and you want better performance, then yes. But if not it still won’t hurt performance, only improve it.
There is lot more things to talk about & to share however you can also find this information from where I found like go through Sun officials books, Solarisinternal.com and some great blogs!
It was a great ZFS year and I’m glad that Sun (now Oracle) folks worked and working hard to make it as a Legend in filesystem arena!
It's being almost a year that I’m working on ZFS Filesystem Administration and just finished migrating all of our Solaris servers from UFS root to ZFS root, so now OS data and Application data resides on ZFS. The intention to write this document is to have some handy notes about ZFS & it’s cache mechanism which always plays a vital role in terms of system performance!
This is just a revisit the features/techniques of this great filesystem known as ZFS.
Let’s start from assumptions. In last one year I’ve experienced that there were some assumption that I and others were carrying but before we jump into anything assumption queue should get clear so let’s do that.
Some assumptions about ZFS –
• Tuning ZFS is Evil (TRUE)
• ZFS doesn’t require tuning (FALSE)
• ZFS is a memory hog (TRUE)
• ZFS is slow (FLASE)
• ZFS won’t allow corruption (FLASE)
Alright then, now we clear on few of assumptions and now known to the facts! Let’s take a look at few important features of ZFS –
• ZFS is the world’s first 128-bit file system and as such has a huge capacity.
• Capacity wise Single filesystems 512TB+ (theoretical 264 devices * 264 bytes)
• Trillions of files in a single file system (theoretical 248 files per fileset)
• ZFS is Transaction based, copy-on-write filesystem, so no "fsck" is required
• High Internal data redundancy (New RAID level called RAIDz)
• End-to-end checksum of all data/metadata with Strong algorithm (SHA-256) but CPU consuming sometimes! So can be turn off on data and ON on metadata. Checksums are used to validate blocks.
• Online integrity verification and reconstruction
• Snapshots, filesets, compression, encryption facilities and much more!
What is the ZFS filesystem capacity?
264 — Number of snapshots of any file system
248 — Number of entries in any individual directory
16 EB (264 bytes) — Maximum size of a file system
16 EB — Maximum size of a single file
16 EB — Maximum size of any attribute
256 ZB (278 bytes) — Maximum size of any zpool
256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system)
264 — Number of devices in any zpool
264 — Number of zpools in a system
264 — Number of file systems in a zpool
How about data redundancy in ZFS?
ZFS supports following RAID configurations
> Stripes (RAID-0)
> Mirroring (RAID-1)
> RAID-Z (Similar to RAID-5)
> RAID-Z2 (Double parity, similar to RAID-6)
Hey, what is it…WOW no FSCK needed? HOW?
Yes, it’s true. ZFS does not need fsck to correct filesystem errors/corruptions as due to COW behavior. ZFS maintains its records as a tree of blocks. Every block is accessible via a single block called the “uber-block”. When you change an existing block, instead of getting overwritten a copy of the data is made and then modified before being written to disk this is Copy on Write (COW). This ensures ZFS never overwrites live data. This guarantees the integrity of the file system as a system crash still leaves the on disk data in a completely consistent state. There is no need for fsck. Ever.
ZFS is self healing data feature?
Yes, provided that –
• If a Bad Block is found ZFS can repair it so long as it has another copy
• RAID-1 - ZFS can “heal” bad data blocks using the mirrored copy
• RAID-Z/Z2 - ZFS can “heal” bad data blocks using parity
Also note that self healing is avail for ZFS metadata but not to actual application data.
ZFS Best Practices –
• Tune recordsize only on fixed records DB files
• Mirror for performance
• 64-bit kernel (allows greater ZFS caches)
• configure swap (don't be scared by low memory)
• Don't slice up devices (confuses I/O scheduler)
• Isolate DB log writer if that is critical (use few devices)
• Separate Root pool (system's identify) and data pools (system's function)
• Keep pool below 80% full (helps COW)
• Usage of snapshots/clones for backup/DR purpose
Okay then, let’s talk about ARC first.
ARC stands for “Adaptive/Adjustable Replacement Cache” - ARC is a very fast cache located in the server’s memory (RAM). The amount of ARC available in a server is usually all of the memory except for 1GB or 3/4th of main memory whichever is greater so simple calculation shows 3GB on your 4GB machine.
Originally, ARC concepts has been first described and invented by two IBM researchers Megiddo and Modha in November 2003 but ZFS ARC is significantly modified version of original ARC design.
The major differences of ZFS ARC is –
• The ZFS ARC is variable in size and can react to the available memory & maybe that’s the reason it is called as “Adjustable Replacement Memory”. It can grow in size when memory is available or it can shrink in size when memory is needed for other processes/jobs.
• The designed proposed by Megiddo and Modha assumes the block size should be same but ZFS ARC works with multiple block sizes.
• Under ZFS ARC you can lock pages in the cache to excuse them from the removal. This prevents the cache to remove pages, that are currently in use. This feature is not in original ARC design.
ZFS ARC stores ZFS data and metadata information from all active storage pools in physical memory (RAM) by default as much as possible, except 1 GB of RAM or 3/4th of main memory BUT I would say this is just a thumb rule or theoretical rule and depending on the environment tuning needs to be done for better system performance. Consider limiting the maximum ARC memory footprint in the following situations:
• When a known amount of memory is always required by an application. Databases often fall into this category.
• On platforms that support dynamic reconfiguration of memory boards, to prevent ZFS from growing the kernel fence in onto all boards.
• A system that requires large memory pages might also benefit from limiting the ZFS cache, which tends to breakdown large pages into base pages.
• Finally, if the system is running another non-ZFS file system, in addition to ZFS, it is advisable to leave some free memory to host that other file system's caches.
The ARC grows and consumes memory on the theory that no need exists to return data to the system while there is still plenty of free memory. When the ARC has grown and near to go exceed memory pressure exists, for example, when a new application starts up, then the ARC releases its hold on memory. Lastly and very important to note that ZFS is not designed to steal memory from applications, ZFS is very innocent!
By default, UFS uses page caching managed virtual memory system however ZFS does not use the page caching except few type of files! ZFS use ARC. There can be only one ARC per system however caching policy can be change per dataset basis.
As I said before, to make sure application/databases has enough dedicated memory available you need to perform tuning or capping on ARC.
# prtconf | grep Mem
Memory size: 98304 Megabytes
# grep zfs /etc/system
set zfs:zfs_arc_min = 1073741824
set zfs:zfs_arc_max = 17179869184
So here, I’ve 96GB total physical memory and ARC has capped at 16G. So around ~17% has been capped for arc_max and 1G for arc_min
The following command gives the current memory size in bytes that is used by ZFS cache:
# kstat zfs::arcstats:size
module: zfs instance: 0
name: arcstats class: misc
size 15325351536
The thumb rules for tuning ARC will be –
• know your future application memory requirements say if it required 20% memory in overall then it makes sense to cap the ARC such that it does not consume more than the remaining 80% of memory.
• Understand and feel your existing applications well. If application known or indeed uses large memory pages then putting cap on ARC will be a bottleneck to that application as limiting the ARC prevents ZFS from breaking up the pages and fragmenting the memory. Limiting the ARC preserves the availability of large pages.
• It’s certainly not easy to tune ARC and need quite a bit deep understanding of applications and their needs though it always helps to have mentioned scripts handy and added to your tools/script depot - arc_summary.pl (By Ben Rockwood) & arcstat.pl (By Neelakanth Nadgir)
Now let’s have a small discussion about L2ARC
L2ARC is a new layer between Disk and the cache (ARC) in main memory for ZFS. It uses dedicated storage devices to hold cached data. The main role of this cache is to boost the performance of random read workloads. The intended L2ARC devices include 10K/15K RPM disks like short-stroked disks, solid state disks (SSD), and other media with substantially faster read latency than disk.
L2ARC, or Level 2 caching in ARC makes it possible to use a disk in addition to RAM for caching reads. This improves read performance, which can otherwise be slow because of the fragmentation caused by the copy-on-write model used by ZFS.
In actual L2ARC cache is supported in Solaris 10 update 6 however as far as my knowledge L2ARC is well supported and functional in Solaris 10 Update 8 and onwards.
The L2ARC attempts to cache data from the ARC before it is evicted so the L2ARC populates its cache by periodically reading data from the tail of the ARC. The data in the tail of the ARC is the data that hasn’t been used for while. I’m still reading through and understanding on how exactly L2ARC works and how it relates to ARC and all that stuffs.
Am I really need L2ARC?
It all depends on the situation, if you have a system where a lot of data is being read frequently and you want better performance, then yes. But if not it still won’t hurt performance, only improve it.
There is lot more things to talk about & to share however you can also find this information from where I found like go through Sun officials books, Solarisinternal.com and some great blogs!
It was a great ZFS year and I’m glad that Sun (now Oracle) folks worked and working hard to make it as a Legend in filesystem arena!
Saturday, July 3, 2010
Swap is in use by Live Upgrade!!!
Just now working on few ZFS post migration cleanup task like creating new ZFS filesystems which are currently on UFS SVM devices, rsync them, ludelete the UFS BE and finally add empty disk to the rpool. While working on this I encountered an error while deleting the swap space holding one of the soft partition.
I was in a process of clearing the metadevices and one of the soft partition holding 32G of swap space and system was not allowing me to delete it. The error I was getting was –
# swap -l
swapfile dev swaplo blocks free
/dev/md/dsk/d34 85,34 16 67108848 67108848
# swap -d /dev/md/dsk/d34
/dev/md/dsk/d34: Not enough space
This was obvious because system was using this swap as a active swap.
# top -c
last pid: 5639; load avg: 3.78, 3.40, 3.41; up 25+15:39:04
1327 processes: 1258 sleeping, 64 zombie, 5 on cpu
CPU states: 91.0% idle, 2.5% user, 6.5% kernel, 0.0% iowait, 0.0% swap
Memory: 96G phys mem, 14G free mem, 32G swap, 32G free swap
Then I was wondering, where is my ZFS swap volume gone? Why system is not using this volume? So I tried making it active using swap -a but I failed to do it & system gave me below message.
# swap -a /dev/zvol/dsk/rpool/swap
/dev/zvol/dsk/rpool/swap is in use for live upgrade -. Please see ludelete(1M).
Okay, so this was first time ever that happened to me. Well after scratching my head on wall for a while I got the answer.
The swap -a attempt might fail if the swap area is already listed in /etc/vfstab or is in use by Live Upgrade. In this case, use the swapadd feature instead.
# /sbin/swapadd
# swap -l
swapfile dev swaplo blocks free
/dev/zvol/dsk/rpool/swap 256,1 16 2097136 2097136
# top -clast pid: 13969; load avg: 2.34, 2.66, 2.84; up 25+16:07:28
1321 processes: 1255 sleeping, 64 zombie, 2 on cpu
CPU states: % idle, % user, % kernel, % iowait, % swap
Memory: 96G phys mem, 12G free mem, 64G swap, 64G free swap
All right then, sometimes it’s good to scratch your head on wall for a while… :) isn’t it?
HTH
I was in a process of clearing the metadevices and one of the soft partition holding 32G of swap space and system was not allowing me to delete it. The error I was getting was –
# swap -l
swapfile dev swaplo blocks free
/dev/md/dsk/d34 85,34 16 67108848 67108848
# swap -d /dev/md/dsk/d34
/dev/md/dsk/d34: Not enough space
This was obvious because system was using this swap as a active swap.
# top -c
last pid: 5639; load avg: 3.78, 3.40, 3.41; up 25+15:39:04
1327 processes: 1258 sleeping, 64 zombie, 5 on cpu
CPU states: 91.0% idle, 2.5% user, 6.5% kernel, 0.0% iowait, 0.0% swap
Memory: 96G phys mem, 14G free mem, 32G swap, 32G free swap
Then I was wondering, where is my ZFS swap volume gone? Why system is not using this volume? So I tried making it active using swap -a but I failed to do it & system gave me below message.
# swap -a /dev/zvol/dsk/rpool/swap
/dev/zvol/dsk/rpool/swap is in use for live upgrade -. Please see ludelete(1M).
Okay, so this was first time ever that happened to me. Well after scratching my head on wall for a while I got the answer.
The swap -a attempt might fail if the swap area is already listed in /etc/vfstab or is in use by Live Upgrade. In this case, use the swapadd feature instead.
# /sbin/swapadd
# swap -l
swapfile dev swaplo blocks free
/dev/zvol/dsk/rpool/swap 256,1 16 2097136 2097136
# top -clast pid: 13969; load avg: 2.34, 2.66, 2.84; up 25+16:07:28
1321 processes: 1255 sleeping, 64 zombie, 2 on cpu
CPU states: % idle, % user, % kernel, % iowait, % swap
Memory: 96G phys mem, 12G free mem, 64G swap, 64G free swap
All right then, sometimes it’s good to scratch your head on wall for a while… :) isn’t it?
HTH
Friday, June 25, 2010
Oracle 11.1.0.7.0 bug
This week, two days back I hit a bug in Oracle 11.1.0.7.0
If you get following error while database start up or if your database crashed due to SAN unavailability, unexpected server crash, power outage etc and if you're starting database and get following error.
ORA-27167: Attempt to determine if Oracle binary image is stored on remote server failed
ORA-27300: OS system dependent operation:parse_df failed with status: 2
ORA-27301: OS failure message: No such file or directory
ORA-27302: failure occurred at: parse failed
$ sqlplus '/as sysdba'
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Jun 24 07:27:29 2010
Copyright (c) 1982, 2008, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORA-00445: background process "MMNL" did not start after 120 seconds
This error indicate that you have hit a bug 6813883 on oracle 11.1.0.7 version [Metalink article id: 784754.1]
After installing patch 6813883 using Opatch utility database startup went very well and now databases are running like a Piece of cake!
Hope this helps someone...
If you get following error while database start up or if your database crashed due to SAN unavailability, unexpected server crash, power outage etc and if you're starting database and get following error.
ORA-27167: Attempt to determine if Oracle binary image is stored on remote server failed
ORA-27300: OS system dependent operation:parse_df failed with status: 2
ORA-27301: OS failure message: No such file or directory
ORA-27302: failure occurred at: parse failed
$ sqlplus '/as sysdba'
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Jun 24 07:27:29 2010
Copyright (c) 1982, 2008, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORA-00445: background process "MMNL" did not start after 120 seconds
This error indicate that you have hit a bug 6813883 on oracle 11.1.0.7 version [Metalink article id: 784754.1]
After installing patch 6813883 using Opatch utility database startup went very well and now databases are running like a Piece of cake!
Hope this helps someone...
Thursday, June 3, 2010
Modify number of CPUs from a pool/pset while it is running
Sometimes you need to modify number of CPU's for a perticular pool. In this case you only need to transfer #CPUs from the pset pset_default to the pset of your pool.
In my case I've two psets available with me -
# poolcfg -dc info
[... Long Lines of Output ...]
pset oracle_pset
int pset.sys_id 1
boolean pset.default false
uint pset.min 4
uint pset.max 4
string pset.units population
uint pset.load 5563
uint pset.size 4
string pset.comment
pset pset_default
int pset.sys_id -1
boolean pset.default true
uint pset.min 1
uint pset.max 65536
string pset.units population
uint pset.load 574
uint pset.size 12
string pset.comment
[... Long Lines of Output ...]
So here you can see that I've 2 processor sets named oracle_pset having 4 CPUs & pset_default having 12 CPUs. Now the situation is, application/DB demands more CPU capacity than currently I have. So in this case you can modify number of CPUs from running pool/pset.
Here is a method to do so -
Save you current configuration
# pooladm -s
Modify the CPU's using "-d" - -d operates directly on the kernel state.
# poolcfg -dc 'modify pset oracle_pset ( uint pset.min = 6 ; uint pset.max = 6)'
Transfer 2 CPUs from pset_default to oracle_pset
# poolcfg -dc 'transfer 2 from pset pset_default to oracle_pset' OR if you want a specific processor then - # poolcfg -dc 'transfer to pset oracpu_pset ( cpu 5)'
Update the configuration in /etc/pooladm.conf file.
# pooladm -c
# poolcfg -dc info
[... Long Lines of Output ...]
pset oracpu_pset
int pset.sys_id 1
boolean pset.default false
uint pset.min 6
uint pset.max 6
string pset.units population
uint pset.load 2009
uint pset.size 6
string pset.comment
pset pset_default
int pset.sys_id -1
boolean pset.default true
uint pset.min 1
uint pset.max 65536
string pset.units population
uint pset.load 498
uint pset.size 10
string pset.comment
[... Long Lines of Output ...]
Runtime example –
$vmstat 5
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr sd sd sd sd in sy cs us sy id0 0 0 36536976 6040672 244 1343 0 0 0 0 0 0 0 0 0 1105 17598 1840 16 14 70
0 0 0 36529296 6040216 96 341 0 0 0 0 0 2 2 0 0 1034 12900 1268 13 8 79
0 0 0 36531896 6039864 367 1412 0 0 0 0 0 0 0 0 0 1145 33910 1625 16 15 69
processors removed: 4, 5
3 0 0 39565360 8833624 597 2343 0 0 0 0 0 7 7 0 0 1379 33522 5416 24 14 62
0 0 0 36533816 6040080 516 1999 0 0 0 0 0 0 1 0 0 823 15979 1893 21 17 62
7 0 0 36500632 6013496 788 4785 0 0 0 0 0 3 3 0 0 1029 69589 1848 37 24 39
16 0 0 36370632 5920912 2258 14106 0 0 0 0 0 3 6 0 0 1819 165593 3616 62 38 0
15 0 0 36466944 5978160 911 4854 0 0 0 0 0 9 9 0 0 2095 298114 3968 67 32 0
8 0 0 36579424 6058944 511 3664 0 0 0 0 0 0 0 0 0 1690 238234 4102 57 30 13
processors added: 4, 5
[see the perfromance boost up, look at processor idle]
3 0 0 39565240 8833520 626 2467 0 0 0 0 0 7 7 0 0 9437 34315 7514 16 11 73
1 0 0 36645344 6103968 163 630 0 0 0 0 0 0 0 0 0 1243 14485 1989 15 9 76
1 0 0 36612504 6079352 205 1985 0 0 0 0 0 0 0 0 0 1393 86756 1705 28 15 57
0 0 0 36648152 6102368 199 1015 0 0 0 0 0 0 0 0 0 1247 17551 1909 18 10 72
0 0 0 36659264 6114368 46 672 0 0 0 0 0 6 6 0 0 1154 11101 1670 13 8 80
1 0 0 36668248 6118136 297 1124 0 0 0 0 0 2 1 0 0 1493 37868 3793 18 16 66
1 0 0 36674216 6121112 48 430 0 0 0 0 0 0 0 0 0 1268 12972 2075 14 9 77
2 0 0 36677736 6124456 317 1402 0 0 0 0 0 0 0 0 0 1414 18159 2336 17 10 73
1 0 0 36673120 6120576 365 1361 0 0 0 0 0 0 0 0 0 1360 17413 2494 14 10 75
1 0 0 36678104 6124032 217 713 0 0 0 0 0 1 1 0 0 1107 12223 1781 14 12 74
Hope this helps!
In my case I've two psets available with me -
# poolcfg -dc info
[... Long Lines of Output ...]
pset oracle_pset
int pset.sys_id 1
boolean pset.default false
uint pset.min 4
uint pset.max 4
string pset.units population
uint pset.load 5563
uint pset.size 4
string pset.comment
pset pset_default
int pset.sys_id -1
boolean pset.default true
uint pset.min 1
uint pset.max 65536
string pset.units population
uint pset.load 574
uint pset.size 12
string pset.comment
[... Long Lines of Output ...]
So here you can see that I've 2 processor sets named oracle_pset having 4 CPUs & pset_default having 12 CPUs. Now the situation is, application/DB demands more CPU capacity than currently I have. So in this case you can modify number of CPUs from running pool/pset.
Here is a method to do so -
Save you current configuration
# pooladm -s
Modify the CPU's using "-d" - -d operates directly on the kernel state.
# poolcfg -dc 'modify pset oracle_pset ( uint pset.min = 6 ; uint pset.max = 6)'
Transfer 2 CPUs from pset_default to oracle_pset
# poolcfg -dc 'transfer 2 from pset pset_default to oracle_pset' OR if you want a specific processor then - # poolcfg -dc 'transfer to pset oracpu_pset ( cpu 5)'
Update the configuration in /etc/pooladm.conf file.
# pooladm -c
# poolcfg -dc info
[... Long Lines of Output ...]
pset oracpu_pset
int pset.sys_id 1
boolean pset.default false
uint pset.min 6
uint pset.max 6
string pset.units population
uint pset.load 2009
uint pset.size 6
string pset.comment
pset pset_default
int pset.sys_id -1
boolean pset.default true
uint pset.min 1
uint pset.max 65536
string pset.units population
uint pset.load 498
uint pset.size 10
string pset.comment
[... Long Lines of Output ...]
Runtime example –
$vmstat 5
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr sd sd sd sd in sy cs us sy id0 0 0 36536976 6040672 244 1343 0 0 0 0 0 0 0 0 0 1105 17598 1840 16 14 70
0 0 0 36529296 6040216 96 341 0 0 0 0 0 2 2 0 0 1034 12900 1268 13 8 79
0 0 0 36531896 6039864 367 1412 0 0 0 0 0 0 0 0 0 1145 33910 1625 16 15 69
processors removed: 4, 5
3 0 0 39565360 8833624 597 2343 0 0 0 0 0 7 7 0 0 1379 33522 5416 24 14 62
0 0 0 36533816 6040080 516 1999 0 0 0 0 0 0 1 0 0 823 15979 1893 21 17 62
7 0 0 36500632 6013496 788 4785 0 0 0 0 0 3 3 0 0 1029 69589 1848 37 24 39
16 0 0 36370632 5920912 2258 14106 0 0 0 0 0 3 6 0 0 1819 165593 3616 62 38 0
15 0 0 36466944 5978160 911 4854 0 0 0 0 0 9 9 0 0 2095 298114 3968 67 32 0
8 0 0 36579424 6058944 511 3664 0 0 0 0 0 0 0 0 0 1690 238234 4102 57 30 13
processors added: 4, 5
[see the perfromance boost up, look at processor idle]
3 0 0 39565240 8833520 626 2467 0 0 0 0 0 7 7 0 0 9437 34315 7514 16 11 73
1 0 0 36645344 6103968 163 630 0 0 0 0 0 0 0 0 0 1243 14485 1989 15 9 76
1 0 0 36612504 6079352 205 1985 0 0 0 0 0 0 0 0 0 1393 86756 1705 28 15 57
0 0 0 36648152 6102368 199 1015 0 0 0 0 0 0 0 0 0 1247 17551 1909 18 10 72
0 0 0 36659264 6114368 46 672 0 0 0 0 0 6 6 0 0 1154 11101 1670 13 8 80
1 0 0 36668248 6118136 297 1124 0 0 0 0 0 2 1 0 0 1493 37868 3793 18 16 66
1 0 0 36674216 6121112 48 430 0 0 0 0 0 0 0 0 0 1268 12972 2075 14 9 77
2 0 0 36677736 6124456 317 1402 0 0 0 0 0 0 0 0 0 1414 18159 2336 17 10 73
1 0 0 36673120 6120576 365 1361 0 0 0 0 0 0 0 0 0 1360 17413 2494 14 10 75
1 0 0 36678104 6124032 217 713 0 0 0 0 0 1 1 0 0 1107 12223 1781 14 12 74
Hope this helps!
Friday, May 14, 2010
Migrating zones between sun4u and sun4v systems
Migrating zones between sun4u and sun4v systems
============================================
I've recently started off with my new project which is a mixture of UFS --> ZFS migration, Zones/Containers migration from one host to another & patching. The real challenge is I've to do it with minimum downtime & have to be "real fast & accurate" at execution.
As I've already started with this project so before jump into project I done some detail study on few subjects related to this project so thought of publishing my findings on my blog.
The first question came in my mind is - If the zone is residing on V890 i.e sun4u arch & I've to move it to SPARC-Enterprise-T5120 i.e. sun4v arch then is it supported & if yes then how it can be done? Below paragraph talks about it.
A recent (not that recent) RFE to make attach work across sun4u and sun4v - 6576592 RFE: zoneadm detach/attach should work between sun4u and sun4v architecture.
Starting with the Solaris 10 10/08 release, zoneadm attach with the -u option also enables migration between machine classes, such as from sun4u to sun4v.
Note for Solaris 10 10/08: If the new host has later versions of the zone-dependent packages and their associated patches, using zoneadm attach with the -u option updates those packages within the zone to match the new host. The update on attach software looks at the zone that is being migrated and determines which packages must be updated to match the new host. Only those packages are updated. The rest of the packages, and their associated patches, can vary from zone to zone.
This option also enables automatic migration between machine classes,such as from sun4u to sun4v.
Okay now when I'm all clear with this doubt so let's move ahead with looking at how to do the migration & what all steps are involved to do so.
Overview -
Migrating a zone from one system to another involves the following steps:
1. Detaching the Zone. This leaves the zone on the originating system in the "configured" state. Behind the scenes, the system will generate a "manifest" of the information needed to validate that the zone can be successfully attached to a new host machine.
2. Data Migration or if your zones are on SAN then re-zone those LUNs. At this stage we may choose to move the data or rezone the storage LUNs which represents the zone to a new host system.
3. Zone Configuration. at this stage we have to create the zone configuration on the new host using zonecfg command.
4. Attaching & if required update (-u) the zone. This will validate that the host is capable of supporting the zone before the attach can succeed. The zone is left in the "installed" state.
5. Boot the zone & have a fun as here you completes the zone migration.
Let's talk more about point #2.
How to Move the zonepath to a new Host?
There are several ways to create an archive of the zonepath. You can use the cpio or pax commands/utilities to archive your zonepath.
There are also several ways to transfer the archive to the new host. The mechanism used to transfer the zonepath from the source host to the destination depends on the local configuration. One can go for SCP, FTP or if it's on ZFS then zfs send/receive etc.
In some cases, such as a SAN, the zonepath data might not actually move. The SAN might simply be reconfigured so the zonepath is visible on the new host. This is what we do in our environment & that's the reason I prefer to have zoneroot on SAN.
Try before you do
Starting from Solaris 10 5/08, You can perform a trial run before the zone is moved to the new machine by using the “no execute” option,-n.
Here is the details how it actually works -
The zoneadm detach subcommand is used with the -n option to generate a manifest on a running zone without actually detaching the zone. The state of the zone on the originating system is not changed. The zone manifest is sent to stdout.
Then we can direct this output to a file or pipe it to a remote command to be immediately validated on the target host. The zoneadm attach subcommand is used with the -n option to read this manifest and verify that the target machine has the correct configuration to host the zone without actually doing an attach.
The zone on the target system does not have to be configured on the new host before doing a trial-run attach.
E.g.
gz1_source:/
# uname -m
sun4u
gz1_dest:/
# uname -m
sun4v
gz1_source:/
# zoneadm list -icv
ID NAME STATUS PATH BRAND IP
0 global running / native shared
7 zone1 running /zone1/zonepath native shared
gz1_source:/
# zoneadm -z zone1 detach -n | ssh gz1_dest zoneadm attach -n -
The validation is output to the source host screen, which is stdout.
I hope this information will help me to get started with project work.
============================================
I've recently started off with my new project which is a mixture of UFS --> ZFS migration, Zones/Containers migration from one host to another & patching. The real challenge is I've to do it with minimum downtime & have to be "real fast & accurate" at execution.
As I've already started with this project so before jump into project I done some detail study on few subjects related to this project so thought of publishing my findings on my blog.
The first question came in my mind is - If the zone is residing on V890 i.e sun4u arch & I've to move it to SPARC-Enterprise-T5120 i.e. sun4v arch then is it supported & if yes then how it can be done? Below paragraph talks about it.
A recent (not that recent) RFE to make attach work across sun4u and sun4v - 6576592 RFE: zoneadm detach/attach should work between sun4u and sun4v architecture.
Starting with the Solaris 10 10/08 release, zoneadm attach with the -u option also enables migration between machine classes, such as from sun4u to sun4v.
Note for Solaris 10 10/08: If the new host has later versions of the zone-dependent packages and their associated patches, using zoneadm attach with the -u option updates those packages within the zone to match the new host. The update on attach software looks at the zone that is being migrated and determines which packages must be updated to match the new host. Only those packages are updated. The rest of the packages, and their associated patches, can vary from zone to zone.
This option also enables automatic migration between machine classes,such as from sun4u to sun4v.
Okay now when I'm all clear with this doubt so let's move ahead with looking at how to do the migration & what all steps are involved to do so.
Overview -
Migrating a zone from one system to another involves the following steps:
1. Detaching the Zone. This leaves the zone on the originating system in the "configured" state. Behind the scenes, the system will generate a "manifest" of the information needed to validate that the zone can be successfully attached to a new host machine.
2. Data Migration or if your zones are on SAN then re-zone those LUNs. At this stage we may choose to move the data or rezone the storage LUNs which represents the zone to a new host system.
3. Zone Configuration. at this stage we have to create the zone configuration on the new host using zonecfg command.
4. Attaching & if required update (-u) the zone. This will validate that the host is capable of supporting the zone before the attach can succeed. The zone is left in the "installed" state.
5. Boot the zone & have a fun as here you completes the zone migration.
Let's talk more about point #2.
How to Move the zonepath to a new Host?
There are several ways to create an archive of the zonepath. You can use the cpio or pax commands/utilities to archive your zonepath.
There are also several ways to transfer the archive to the new host. The mechanism used to transfer the zonepath from the source host to the destination depends on the local configuration. One can go for SCP, FTP or if it's on ZFS then zfs send/receive etc.
In some cases, such as a SAN, the zonepath data might not actually move. The SAN might simply be reconfigured so the zonepath is visible on the new host. This is what we do in our environment & that's the reason I prefer to have zoneroot on SAN.
Try before you do
Starting from Solaris 10 5/08, You can perform a trial run before the zone is moved to the new machine by using the “no execute” option,-n.
Here is the details how it actually works -
The zoneadm detach subcommand is used with the -n option to generate a manifest on a running zone without actually detaching the zone. The state of the zone on the originating system is not changed. The zone manifest is sent to stdout.
Then we can direct this output to a file or pipe it to a remote command to be immediately validated on the target host. The zoneadm attach subcommand is used with the -n option to read this manifest and verify that the target machine has the correct configuration to host the zone without actually doing an attach.
The zone on the target system does not have to be configured on the new host before doing a trial-run attach.
E.g.
gz1_source:/
# uname -m
sun4u
gz1_dest:/
# uname -m
sun4v
gz1_source:/
# zoneadm list -icv
ID NAME STATUS PATH BRAND IP
0 global running / native shared
7 zone1 running /zone1/zonepath native shared
gz1_source:/
# zoneadm -z zone1 detach -n | ssh gz1_dest zoneadm attach -n -
The validation is output to the source host screen, which is stdout.
I hope this information will help me to get started with project work.
Wednesday, May 12, 2010
Amazon EC2 pricing models
Cloud computing is changing the way IT resources are utilized. Now a days cloud computing is one of the Innovated technology platform & emerging trend for IT industry & certainly not limited to it.
Cloud computing is a simple idea, but it can have a huge impact on business.
There are many vendors for enabling/providing cloud computing solutions like Amazon, VMWare, Rackspace and many more.
Today we will focus on Amazon EC2 (Amazon Elastic Compute Cloud).
Amazon is a leading public cloud computing provider, AWS (Amazon Web Services) falls in the infrastructure as a service (IaaS) space, providing on demand service using virtual server instances with unique IP addresses and blocks of storage.
It's a news for me and may be for many of us that AWS opening a data center in Singapore to make its entry in the Asia Pacific region has created a lot of interest about Amazon Elastic Compute Cloud (Amazon EC2) in India.
So just to take a quick tour of AWS offerings - AWS provides various components/services like Amazon EC2, Amazon Simple Storage Service (Amazon S3), Amazon SimpleDB, Amazon Relational Database Service (Amazon RDS).
As AWS is stepping in to Asia Pacific region I'm very excited to know what will be the pricing model for their service offerings. So let's take a look at their pricing model.
Amazon EC2 pricing models follows a "pay-as-you-go" model (as in any other cloud computing model) however the flexibility is at it's par in case of Amazon EC2 pricing model.
When it comes to Amazon EC2 pricing models, the instances are grouped into three families:
1. Standard,
2. High-Memory
3. High-CPU.
Amazon EC2 pricing for each of these instances is as follows:
1. Standard Instances -- This model have memory to CPU ratios suitable for most general purpose applications. This Amazon EC2 pricing model ranges from $0.12 per hour to $ 0.96 per hour, for services running on Windows infrastructure. For infrastructure running on Linux and UNIX, Amazon EC2 pricing starts with $0.095 per hour to $0.76 per hour.
2. High-Memory instances -- This Amazon EC2 pricing model offers larger memory sizes for high throughput applications, including database and memory caching applications. This is priced at $ 0.62 per hour to $2.88 per hour for Windows based infrastructure. For Linux/UNIX based instances in this pricing model of Amazon EC2, prices range from $0.57 per hour to $2.68 per hour.
3. High-CPU instances: In this Amazon EC2 pricing model, proportionally more CPU resources are consumed than memory (RAM), and as a a result it targets compute-intensive applications. This Amazon EC2 suite has been priced at $0.29 per hour to $1.16 per hour for Windows based infrastructure. For Linux or UNIX based infrastructure, this Amazon EC2 pricing model charges from $0.19 per hour to $0.76 per hour.
When choosing Amazon EC2 pricing types, organizations should consider characteristics of their application with regards to resource utilization. Accordingly, they should select the optimal instance family and size.
Now we have understood the pricing model & now next question comes in heart is how about support model so here we go -
here are two AWS Premium Support offerings—Gold and Silver. The following is the breakup of the services.
Gold Support includes:
-Business day support (6 a.m. to 6 p.m.)
-24x7x365 coverage.
-One-on-one support via web-based ticketing system.
-One-on-one support via telephone.
-1 hour maximum response time for "Urgent issues".
-Guaranteed response time for non-urgent issues.
-Client-side diagnostic tools.
-Named support contacts.
Silver Support includes:
-Business day support (6 a.m. to 6 p.m.).
-One-on-one support via web-based ticketing system.
-Guaranteed response time for non-urgent issues.
-Client-side diagnostic tools.
-Named support contacts.
I hope it's worth knowing all this information as a IT Infrastructure professional.
Also as we all know pictures talks louder & who agree to it then here is a very good video for understanding cloud computing in simple language.
http://www.youtube.com/watch?v=XdBd14rjcs0
Hope this article helps...
Cloud computing is a simple idea, but it can have a huge impact on business.
There are many vendors for enabling/providing cloud computing solutions like Amazon, VMWare, Rackspace and many more.
Today we will focus on Amazon EC2 (Amazon Elastic Compute Cloud).
Amazon is a leading public cloud computing provider, AWS (Amazon Web Services) falls in the infrastructure as a service (IaaS) space, providing on demand service using virtual server instances with unique IP addresses and blocks of storage.
It's a news for me and may be for many of us that AWS opening a data center in Singapore to make its entry in the Asia Pacific region has created a lot of interest about Amazon Elastic Compute Cloud (Amazon EC2) in India.
So just to take a quick tour of AWS offerings - AWS provides various components/services like Amazon EC2, Amazon Simple Storage Service (Amazon S3), Amazon SimpleDB, Amazon Relational Database Service (Amazon RDS).
As AWS is stepping in to Asia Pacific region I'm very excited to know what will be the pricing model for their service offerings. So let's take a look at their pricing model.
Amazon EC2 pricing models follows a "pay-as-you-go" model (as in any other cloud computing model) however the flexibility is at it's par in case of Amazon EC2 pricing model.
When it comes to Amazon EC2 pricing models, the instances are grouped into three families:
1. Standard,
2. High-Memory
3. High-CPU.
Amazon EC2 pricing for each of these instances is as follows:
1. Standard Instances -- This model have memory to CPU ratios suitable for most general purpose applications. This Amazon EC2 pricing model ranges from $0.12 per hour to $ 0.96 per hour, for services running on Windows infrastructure. For infrastructure running on Linux and UNIX, Amazon EC2 pricing starts with $0.095 per hour to $0.76 per hour.
2. High-Memory instances -- This Amazon EC2 pricing model offers larger memory sizes for high throughput applications, including database and memory caching applications. This is priced at $ 0.62 per hour to $2.88 per hour for Windows based infrastructure. For Linux/UNIX based instances in this pricing model of Amazon EC2, prices range from $0.57 per hour to $2.68 per hour.
3. High-CPU instances: In this Amazon EC2 pricing model, proportionally more CPU resources are consumed than memory (RAM), and as a a result it targets compute-intensive applications. This Amazon EC2 suite has been priced at $0.29 per hour to $1.16 per hour for Windows based infrastructure. For Linux or UNIX based infrastructure, this Amazon EC2 pricing model charges from $0.19 per hour to $0.76 per hour.
When choosing Amazon EC2 pricing types, organizations should consider characteristics of their application with regards to resource utilization. Accordingly, they should select the optimal instance family and size.
Now we have understood the pricing model & now next question comes in heart is how about support model so here we go -
here are two AWS Premium Support offerings—Gold and Silver. The following is the breakup of the services.
Gold Support includes:
-Business day support (6 a.m. to 6 p.m.)
-24x7x365 coverage.
-One-on-one support via web-based ticketing system.
-One-on-one support via telephone.
-1 hour maximum response time for "Urgent issues".
-Guaranteed response time for non-urgent issues.
-Client-side diagnostic tools.
-Named support contacts.
Silver Support includes:
-Business day support (6 a.m. to 6 p.m.).
-One-on-one support via web-based ticketing system.
-Guaranteed response time for non-urgent issues.
-Client-side diagnostic tools.
-Named support contacts.
I hope it's worth knowing all this information as a IT Infrastructure professional.
Also as we all know pictures talks louder & who agree to it then here is a very good video for understanding cloud computing in simple language.
http://www.youtube.com/watch?v=XdBd14rjcs0
Hope this article helps...
Monday, May 10, 2010
Creating CPU resource pool & Processor set in Solaris 10
Resource pools are used for partitioning server resources. It's a workload management framework.
I mostly work on Solaris servers with Oracle database hosted on it. Limited to subject, for my requirement I create CPU Pools to Support Oracle Licensing. All Global Zone servers running oracle containers should have at least one oracle cpu pool. The containers should be "bound" to this pool.
Oracle Licensing offered in two forms as far as know - CPU based & User based licenses.
How to create pools?
# pooladm -e <<< The pools facility is not active by default when Solaris starts. pooladm -e explicitly activates the pools facility.
OPTION: If you wish to enbale resource pool fuction via SMF then just execute -
# /usr/sbin/svcadm enable svc:/system/pools:default
# pooladm -s <<< Save the current configuration to /etc/pooladm.conf
# pooladm <<< Shows current running pools configuration
Create Processor Set -
# poolcfg -c 'create pset oracle_pset (uint pset.min=4; uint pset.max=4)' <<< pset.min & pset.max is consider the hardware thread.
Create resource pool -
# poolcfg -c 'create pool oracle_pool' <<< Create pool
Associate resource pool & Processor Set -
# poolcfg -c 'associate pool oracle_pool (pset oracle_pset)' <<< Associate pool with pset
NOTE: The global zones scheduler should be set to use FSS.
Set the default scheduling class to FSS:
# dispadmin -d FSS
# poolcfg -c 'modify pool oracle_pool (string pool.scheduler="FSS")' <<< Enable FSS on pool
# pooladm -c <<< Activate the configuration (After executing this command you can see /etc/pooladm.conf has been modified with current configuration)
Now once you're done creating & associating the pset & resource pool next thing to do is configure the Non-Global Zone configuration.
# zonecfg -z zone1
zonecfg:zone1> set pool=oracle_pool
zonecfg:zone1> exit
Once the zone configuration has been altered then you can bind the resource pool to zone using -
# poolbind -p oracle_pool -i zoneid
This procedure demonstrate on how to create CPU/Processor sets, resource pools, binding them to containers like activities.
Hope this will help.
I mostly work on Solaris servers with Oracle database hosted on it. Limited to subject, for my requirement I create CPU Pools to Support Oracle Licensing. All Global Zone servers running oracle containers should have at least one oracle cpu pool. The containers should be "bound" to this pool.
Oracle Licensing offered in two forms as far as know - CPU based & User based licenses.
How to create pools?
# pooladm -e <<< The pools facility is not active by default when Solaris starts. pooladm -e explicitly activates the pools facility.
OPTION: If you wish to enbale resource pool fuction via SMF then just execute -
# /usr/sbin/svcadm enable svc:/system/pools:default
# pooladm -s <<< Save the current configuration to /etc/pooladm.conf
# pooladm <<< Shows current running pools configuration
Create Processor Set -
# poolcfg -c 'create pset oracle_pset (uint pset.min=4; uint pset.max=4)' <<< pset.min & pset.max is consider the hardware thread.
Create resource pool -
# poolcfg -c 'create pool oracle_pool' <<< Create pool
Associate resource pool & Processor Set -
# poolcfg -c 'associate pool oracle_pool (pset oracle_pset)' <<< Associate pool with pset
NOTE: The global zones scheduler should be set to use FSS.
Set the default scheduling class to FSS:
# dispadmin -d FSS
# poolcfg -c 'modify pool oracle_pool (string pool.scheduler="FSS")' <<< Enable FSS on pool
# pooladm -c <<< Activate the configuration (After executing this command you can see /etc/pooladm.conf has been modified with current configuration)
Now once you're done creating & associating the pset & resource pool next thing to do is configure the Non-Global Zone configuration.
# zonecfg -z zone1
zonecfg:zone1> set pool=oracle_pool
zonecfg:zone1> exit
Once the zone configuration has been altered then you can bind the resource pool to zone using -
# poolbind -p oracle_pool -i zoneid
This procedure demonstrate on how to create CPU/Processor sets, resource pools, binding them to containers like activities.
Hope this will help.
Friday, May 7, 2010
FATAL: system is not bootable, boot command is disabled
In system administration job normally no news is a good news... Today I was working with Solaris MPxIO on V890 server model & due to some unknown MPxIO misconfiguration under /kernel/drv/fp.conf - the whole system got messed up & I left with option of rebuilding the whole system.
Here is the crash pattern after MPxIO misconfiguration -
Rebooting with command: boot
Boot device: /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@w21000014c3dbf465,0:a File and args:
SunOS Release 5.10 Version Generic_142900-01 64-bit
Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
NOTICE: error reading device label
NOTICE:
***************************************************
* This device is not bootable! *
* It is either offlined or detached or faulted. *
* Please try to boot from a different device. *
***************************************************
NOTICE: spa_import_rootpool: error 19
Cannot mount root on /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@w21000014c3dbf465,0:a
fstype zfs
panic[cpu7]/thread=180e000: vfs_mountroot: cannot mount root
000000000180b950 genunix:vfs_mountroot+358 (800, 200, 0, 1872800, 189b400, 18cbc00)
Thank god that the system was not rolled out to production yet.
When I realized the system is crashed & I may need to boot the system into single user mode for maintenance I logged onto the SC and got to my console, and type boot as one does.
{1} ok boot
FATAL: system is not bootable, boot command is disabled
Ohh.. no.. What the mess...
There are many errors which you never imagine or unseen throughout your tiny professional life & I hit this one today, for the first time in my 6 years of professional life.
Just in case you happen to hit this horrible error then here is the fix -
set auto-boot? to false, reset the box, and then set it to true and finally boot as shown below -
{1} ok setenv auto-boot? false
auto-boot? = false
{1} ok reset-all
SC Alert: Host System has Reset
Sun Fire V890, No Keyboard
Copyright 2007 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.22.34, 65536 MB memory installed, Serial #XXXXX.
Ethernet address X:XX:XX:XX:XX:XX, Host ID: XXXXX.
{1} ok setenv auto-boot? true
auto-boot? = true
{1} ok boot net - install nowin
.......... lots of output ........
This will rebuild your system now.
One advice - BE ALWAYS CAREFUL WHILE WORKING WITH MPxIO. YOUR BEST FRIEND MAY TURN INTO WORST ENEMY IF YOU HURT HIM...
Here is the crash pattern after MPxIO misconfiguration -
Rebooting with command: boot
Boot device: /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@w21000014c3dbf465,0:a File and args:
SunOS Release 5.10 Version Generic_142900-01 64-bit
Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
NOTICE: error reading device label
NOTICE:
***************************************************
* This device is not bootable! *
* It is either offlined or detached or faulted. *
* Please try to boot from a different device. *
***************************************************
NOTICE: spa_import_rootpool: error 19
Cannot mount root on /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@w21000014c3dbf465,0:a
fstype zfs
panic[cpu7]/thread=180e000: vfs_mountroot: cannot mount root
000000000180b950 genunix:vfs_mountroot+358 (800, 200, 0, 1872800, 189b400, 18cbc00)
Thank god that the system was not rolled out to production yet.
When I realized the system is crashed & I may need to boot the system into single user mode for maintenance I logged onto the SC and got to my console, and type boot as one does.
{1} ok boot
FATAL: system is not bootable, boot command is disabled
Ohh.. no.. What the mess...
There are many errors which you never imagine or unseen throughout your tiny professional life & I hit this one today, for the first time in my 6 years of professional life.
Just in case you happen to hit this horrible error then here is the fix -
set auto-boot? to false, reset the box, and then set it to true and finally boot as shown below -
{1} ok setenv auto-boot? false
auto-boot? = false
{1} ok reset-all
SC Alert: Host System has Reset
Sun Fire V890, No Keyboard
Copyright 2007 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.22.34, 65536 MB memory installed, Serial #XXXXX.
Ethernet address X:XX:XX:XX:XX:XX, Host ID: XXXXX.
{1} ok setenv auto-boot? true
auto-boot? = true
{1} ok boot net - install nowin
.......... lots of output ........
This will rebuild your system now.
One advice - BE ALWAYS CAREFUL WHILE WORKING WITH MPxIO. YOUR BEST FRIEND MAY TURN INTO WORST ENEMY IF YOU HURT HIM...
Subscribe to:
Posts (Atom)