Find it

Monday, September 14, 2009

Solaris Crash dump stuff

Morning I decided to concentrate and learn more on crash/core dumps, while driving my car I was recalling the things which I am aware about this subject and finally I came to know what I know is not enough!!! so decided to take a deep look at core file and core file management.

Before moving ahead first we will understand what is crash dump and what is core dump?

Crash dump --> A crash dump is the dump of the kernel. It is done in case of a crash(kernel panic) of the system. Crashing kernel produces a crash dump. configure using the dumpadm utility.

Core dump --> The core dump is the dump of the memory of a single process, crashing application can produce a core file, configure using the coreadm utility.

Okay... Lets start with "how to generate crash dump or infact how to force crash dump or core dump"

Generating crash dump on solaris - This section will educate us on what all ways are available to generate core file.

There are 4 methods that I am aware for getting core files generated on solaris host.

1. OK> sync

This is most common method for generating core dump on Solaris.

2. # reboot -d

This will reboot the host, and will generate a core file as part of the reboot.

3. # savecore -Lv

Where -

L - Save a crash dump of the live running Solaris system, without actually rebooting or altering the system in any way. This option forces savecore to save a live snapshot of the system to the dump device, and then immediately to retrieve the data and to write it out to a new set of crash dump files in the specified directory. Live system crash dumps can only be performed if you have configured your system to have a dedicated dump device using dumpadm.

NOTE: savecore -L does not suspend the system, so the contents of memory continue to change while dump is saved. Dumps taken by this method are not so self-consistent.

v - verbose mode.

4. Using uadmin administrative command

# uadmin 5 0
Sep 14 06:01:06 slabinfra4 savecore: saving system crash dump in /var/crash/XXXX/*.0

panic[cpu1]/thread=300024a60a0: forced crash dump initiated at user request

000002a100aa1960 genunix:kadmin+4a4 (b4, 0, 0, 125ec00, 5, 0)
%l0-3: 0000000001815000 00000000011cb800 0000000000000004 0000000000000004
%l4-7: 0000000000000438 0000000000000010 0000000000000004 0000000000000000
000002a100aa1a20 genunix:uadmin+11c (60016057208, 0, 0, ff390000, 0, 0)
%l0-3: 0000000000000000 0000000000000000 0000000078e10000 00000000000078e1
%l4-7: 0000000000000001 0000000000000000 0000000000000005 00000300024a60a0

syncing file systems... 2 1 done
dumping to /dev/md/dsk/d1, offset 859701248, content: kernel
100% done: 82751 pages dumped, compression ratio 3.05, dump succeeded
Program terminated
{1} ok

===============================================


Create Crash dump file using savecore also know as "LIVE CRASHDUMPS"-

Okay.. While trying to generate the core file on the fly without suspending the system, I got some issue shown below -

# savecore -Lv
savecore: dedicated dump device required

I checked if my dump device is configured or not,

# dumpadm
Dump content: kernel pages
Dump device: /dev/md/dsk/d1 (swap)
Savecore directory: /var/crash/XXXXXX
Savecore enabled: yes

Well, it is configured then what is the issue?

After few mins of search I found Sun Document ID: 3284 - "How to capture a live system core dump without having a dedicated dump device" & I decided to go with this.

This solution talk about creating additional swap space and configuring dumpadm to use this file.

Cool, no issues we will now execute the steps...

1. Check the current dump device configuration.

# dumpadm
Dump content: kernel pages
Dump device: /dev/md/dsk/d1 (swap)
Savecore directory: /var/crash/XXXXXX
Savecore enabled: yes

2. Try to create a core dump on the live system using savecore command.

# savecore -L
savecore: dedicated dump device required

3. To eliminate this, create a new metadevice or a blank file using mkfile.

# metainit d43 1 1 c4t60050768018A8023B800000000000132d0s0
d43: Concat/Stripe is setup

4.Change the dump device to point to the newly created file. Also configure dumpadm to dump only the kernel memory pages. You can omit the -c option to dump all memory pages instead.

# dumpadm -c kernel -d /dev/md/dsk/d43
Dump content: kernel pages
Dump device: /dev/md/dsk/d43 (dedicated)
Savecore directory: /var/crash/slabinfra4
Savecore enabled: yes

5. We can now dump the system core on the new dedicated dump device.

# savecore -L
dumping to /dev/md/dsk/d43, offset 65536, content: kernel
100% done: 80868 pages dumped, compression ratio 3.23, dump succeeded
System dump time: Mon Sep 14 06:00:52 2009
Constructing namelist /var/crash/XXXXXX/unix.0
Constructing corefile /var/crash/XXXXX/vmcore.0
100% done: 80868 of 80868 pages saved

6.We have now saved the core files in the /var/crash/directory.

# cd /var/crash/SystemName
# ls -lrt
total 1312738
-rw-r--r-- 1 root root 1699974 Sep 14 06:01 unix.0
-rw-r--r-- 1 root root 670072832 Sep 14 06:01 vmcore.0

7. Cool... we are done with our job so let us revert the dump device to original

# dumpadm -c kernel -d /dev/md/dsk/d1
Dump content: kernel pages
Dump device: /dev/md/dsk/d1 (swap)
Savecore directory: /var/crash/XXXXX
Savecore enabled: yes

8. If you dont want newly created metadevice you can remove it to save your storage.

# metastat -p
d0 -m d10 d20 1
d10 1 1 c2t0d0s0
d20 1 1 c2t1d0s0
d3 -m d13 d23 1
d13 1 1 c2t0d0s3
d23 1 1 c2t1d0s3
d1 -m d11 d21 1
d11 1 1 c2t0d0s1
d21 1 1 c2t1d0s1
d43 1 1 /dev/dsk/c4t60050768018A8023B800000000000132d0s0
d42 1 1 /dev/dsk/c4t60050768018A8023B800000000000137d0s0
d41 1 1 /dev/dsk/c4t60050768018A8023B800000000000136d0s0
d40 1 1 /dev/dsk/c4t60050768018A8023B800000000000135d0s0
d30 -p d4 -o 2097216 -b 2097152
d4 -m d14 d24 1
d14 1 1 c2t0d0s4
d24 1 1 c2t1d0s4
d31 -p d4 -o 32 -b 2097152

# metaclear d43
d43: Concat/Stripe is cleared

=================================================

How to panic your own system?


First we will see how we can crash our system in a pretty shopisticated way -

We'll start by crashing a Solaris 2 system. Is savecore ready? Okay, then, let's panic your system!

Ok.. adb is very old tool and now it is replaced by mdb (modular debugger). I am giving both examples for crashing your system using adb or mdb.

# mdb -kw
Loading modules: [ unix krtld genunix specfs dtrace cpu.generic uppc pcplusmp ufs ip hook neti sctp arp usba uhci s1394 fctl nca lofs audiosup zfs random cpc crypto fcip ptm sppp nfs ipc ]

> rootdir/W 123
> $q


do ls or something and see your system is panic-ED.....

BTW, one good book for knowing panic well is - PANIC book by Chris Drake and Kimberley Brown.

# adb -k -w /dev/ksyms /dev/mem
physmem 1e05
rootdir/X
rootdir:
rootdir: fc109408
rootdir/W 0
rootdir: 0xfc109408 = 0x0
$q

How does this procedure crash your system? Solaris keeps track of the address of the root vnode structure in a symbol called rootdir. If this vnode pointer is zero, the next time the system tries to do anything that would require walking down a directory path, it will fall over trying to read location zero looking for the root directory's vnode. Reading memory location zero is an illegal operation which results in a bad trap, data fault.

Using adb we will write a zero into rootdir and the system will quickly panic.

If your system doesn't panic immediately, just use the UNIX ls command to get a directory listing of the root directory, /. That will surely do the trick!

==================================================

Fine, now let us learn little tricks of analysing dump, we are going to use adb for now. There are lot many tools/debuger available like mdb,SCAT are the best ones!

# adb -k unix.0 vmcore.0
physmem 7d8b4

NOTE: adb returns the number of pages of physical memory in hexadecimal and then waits for your first command. Note that most versions of adb do not offer the user any prompt at this point. Don't be fooled by this!

${
sysname = [ "SunOS" ]
nodename = [ "XXXXXXX" ]
release = [ "5.10" ]
version = [ "Generic_139555-08" ]
machine = [ "sun4u" ]
}
hw_provider/s
hw_provider:
hw_provider: Sun_Microsystems
architecture/s
architecture:
architecture: sparcv9
srpc_domain/s
srpc_domain:
srpc_domain: uu.XXXX.com
$q

Fine, now we will check out mdb debuger.

# mdb unix.0 vmcore.0
Loading modules: [ unix genunix specfs cpu.generic uppc scsi_vhci ufs ip hook neti sctp arp usba nca lofs zfs random nsctl sdbc rdc sppp ]
> ::ps
S PID PPID PGID SID UID FLAGS ADDR NAME
R 0 0 0 0 0 0x00000001 00000000018398b0 sched
R 3 0 0 0 0 0x00020001 00000600118fb848 fsflush
R 2 0 0 0 0 0x00020001 00000600118fc468 pageout
R 1 0 0 0 0 0x4a004000 00000600118fd088 init
R 24662 1 24662 24662 0 0x52010400 00000600155a5128 vasd
[.....]
> 00000600157b5138::pfiles
FD TYPE VNODE INFO
0 FIFO 00000600170ac300
1 FIFO 00000600170ac200
2 SOCK 00000600200921c0 socket: AF_UNIX /var/run/zones/slabzone1.console_sock
3 DOOR 0000060014c47600 [door to 'zoneadmd' (proc=600157b5138)]
4 CHR 0000060014b13a00 /devices/pseudo/zconsnex@1/zcons@1:zoneconsole
5 DOOR 000006001334d6c0 /var/run/name_service_door [door to 'nscd' (proc=6001217c020)]
6 CHR 000006001538d7c0 /devices/pseudo/zconsnex@1/zcons@1:masterconsole
{Here above I am trying to lookup, which files or sockets where opened at the moment of the crash dump for a perticular process}
> ::msgbuf
MESSAGE
sd1 is /pci@1e,600000/ide@d/sd@0,0
pseudo-device: llc10
llc10 is /pseudo/llc1@0
pseudo-device: tod0
tod0 is /pseudo/tod@0
pseudo-device: lofi0
lofi0 is /pseudo/lofi@0
pseudo-device: fcode0
fcode0 is /pseudo/fcode@0
[....]
IP Filter: v4.1.9, running.
/pseudo/zconsnex@1/zcons@0 (zcons0) online
/pseudo/zconsnex@1/zcons@1 (zcons1) online
/pseudo/zconsnex@1/zcons@2 (zcons2) online
[....]

There is pretty much to learn in mdb however I am not considering it right now for this artical.

There is lot more however I am running short on time, I have to get back to work now! I will try adding SCAT knowledge append to same post soon...

Cool...finally I got some time to write about SCAT - Solaris Crash Analysis Tool.
Now a days there are several versions available for SCAT like - SCAT 4.1, SCAT 5.0, 5.1 & very fresh release is SCAT 5.2 - I am going with Solaris CAT 5.2 version. Lets install it quickly and start working with it.

Download SCAT 5.2 from sun website.

#gunzip SUNWscat5.2-GA-combined.pkg.gz
#pkgadd -G -d ./SUNWscat5.2-GA-combined.pkg

Here we go! SCAT is ready to use at location /opt/SUNWscat. If required get scat command in PATH as shown below -

#export PATH=$PATH:/opt/SUNWscat/bin

Now navigate to the crash dump location.

#cd /var/crash/XXXX
# ls -lrt
total 2655650
-rw-r--r-- 1 root root 1699974 Sep 14 06:01 unix.0
-rw-r--r-- 1 root root 670072832 Sep 14 06:01 vmcore.0
-rw-r--r-- 1 root root 1699974 Sep 15 01:57 unix.1
-rw-r--r-- 1 root root 685514752 Sep 15 01:59 vmcore.1

Ok, now let us execute the scat to analyze the crash.

# scat 0

Solaris[TM] CAT 5.2 for Solaris 10 64-bit UltraSPARC
SV4990M, Aug 26 2009
[.........]
core file: /var/crash/XXXXX/vmcore.0
user: Super-User (root:0)
release: 5.10 (64-bit)
version: Generic_139555-08
machine: sun4u
node name: XXXXXXX
domain: whois.XXXX.com
hw_provider: Sun_Microsystems
system type: SUNW,Sun-Fire-V240 (UltraSPARC-IIIi)
hostid: XXXXXXX
dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/md/dsk/d43(15.9G)
time in kernel: Mon Sep 14 06:01:04 CDT 2009
age of system: 10 days 17 hours 13 minutes 33.78 seconds
CPUs: 2 (4G memory, 1 nodes)
panicstr:

sanity checks: settings...
NOTE: /etc/system: module nfssrv not loaded for "set nfssrv:nfs_portmon=0x1"
vmem...CPU...sysent...misc...
WARNING: 1 severe kstat errors (run "kstat xck")
WARNING: DF_LIVE set in dump_flags
NOTE: system has 2 non-global zones
done
SolarisCAT(vmcore.0/10U)>

Look at O/P carefully, It is shows how many non-global zones are installed on system, what all modules are available for debuging and what all missing for which parameter in /etc/system file...

The available commands a broken down into categories which you can see using the "help" command. The first group are for "Initial Investigation:" and include: analyze, coreinfo, msgbuf, panic, stack, stat, and toolinfo.

There are lot of things to write for SCAT however I will endup this entry with my all time favorite ZFS stuff -

SolarisCAT(vmcore.0/10U)> zfs -e
ZFS spa @ 0x60010cf4080
Pool name: zone1-zp00
State: ACTIVE
VDEV Address State Aux Description
0x60012379540 UNKNOWN - root

READ WRITE FREE CLAIM IOCTL
OPS 0 0 0 0 0
BYTES 0 0 0 0 0

EREAD 0
EWRITE 0
ECKSUM 0

VDEV Address State Aux Description
0x60012379000 UNKNOWN - /dev/dsk/
c4t60050768018A8023B800000000000134d0s0

READ WRITE FREE CLAIM IOCTL
OPS 68803 2107043 0 0 0
BYTES 5.39G 14.3G 0 0 0

EREAD 0
EWRITE 0
ECKSUM 0

ZFS spa @ 0x60011962fc0
Pool name: zone2-zp00
State: ACTIVE
VDEV Address State Aux Description
0x60011962a80 UNKNOWN - root

READ WRITE FREE CLAIM IOCTL
OPS 0 0 0 0 0
BYTES 0 0 0 0 0

EREAD 0
EWRITE 0
ECKSUM 0

VDEV Address State Aux Description
0x60011962540 UNKNOWN - /dev/dsk/
c4t60050768018A8023B800000000000133d0s0

READ WRITE FREE CLAIM IOCTL
OPS 5367 166547 0 0 0
BYTES 252M 795M 0 0 0

EREAD 0
EWRITE 0
ECKSUM 0
SCAT is very powerful tool and I am really impress with this tool. Using SCAT you can learn lot of details of you system...

Hope it will help someone, somewhere! Wish you a Happy Debug!

5 comments:

  1. thank, that's a great start

    ReplyDelete
  2. Hi,

    I'm trying to download SCAT, but Oracle(Sun...) is no longer giving it to customers.

    Do you happen to have a package of SCAT for Solaris by any chance?

    Best regards,
    Ofir.
    ofiraz@gmail.com

    ReplyDelete
  3. Great.
    Thanks Nilesh for sharing such wonderful information.
    I am going through your blogs. Your Blog is wonderful.
    Do keep updating the blog

    Thanks
    Nitin

    ReplyDelete
  4. Hi Thanks for this great information...
    Its really help me how to understanding crush dump working and analyzing it.

    2 thumbs up :D
    Regards,
    ochi

    ReplyDelete