Find it

Monday, August 9, 2010

alloc: /: file system full

Today I came across with a strange issue of file system full with error -


Aug 9 00:17:41 server1 ufs: [ID 845546 kern.notice] NOTICE: alloc: /: file system full

When I looked at the top disk space consumers I found nothing useful.


# df -h | sort -rnk 5

/dev/md/dsk/d0 3.0G 2.9G 0K 100% /
/dev/md/dsk/d3 2.0G 1.5G 404M 80% /var
/dev/md/dsk/d30 469M 330M 93M 79% /opt
/dev/md/dsk/d6 992M 717M 215M 77% /home
/dev/md/dsk/d33 752M 494M 198M 72% /usr/local/install
[...]

After doing a du on whole filesystem I can see it is showing 2.5G only and df showing 2.9G consumed space.

# du -shd /
2.5G

I realized few days back I came across same issue on ZFS filesystem hosting oracle DB and below understanding helped me there.

Normally, If filesystem is full, then look around in the directories that will be hidden by mounted filesystems in higher init states or see if any files that are eating up the disk space, in case if you get nothing useful from this exercise then one of the things to check is the open files and consider what has been cleaned up. Sometimes, if an open file is emptied or unlinked from the directory tree the disk space is not de-allocated until the owning process has been terminated or restarted. The result is an unexplainable loss of disk space. If this is the cause a reboot would clear it up. If you can't reboot consider any process that would be logging to that partition as a suspect and check all of your logs for any entries that imply rapid errors in a process.

In my case, reboot was not possible as the server caused file system full

# lsof +aL1 /


lsof: WARNING: access /.lsof_server1: No such file or directory
lsof: WARNING: created device cache file: /.lsof_server1
lsof: WARNING: can't write to /.lsof_server1: No space left on device
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
scp 16472 root 4r VREG 85,0 238616064 0 119696 / (/dev/md/dsk/d0)
scp 22154 root 4r VREG 85,0 238213120 0 119677 / (/dev/md/dsk/d0)

Where:


``+L1'' will select open files that have been unlinked. A specification of the form ``+aL1 '' will select unlinked open files on the specified file system.

I got the processes ID's via lsof, after verifying the processes I killed them and suddenly it has released ~450MB space.

# df -kh | sort -rnk 5
/dev/md/dsk/d0 3.0G 2.5G 418M 86% /
/dev/md/dsk/d3 2.0G 1.5G 406M 80% /var
/dev/md/dsk/d30 469M 331M 91M 79% /opt
/dev/md/dsk/d6 992M 717M 215M 77% /home
/dev/md/dsk/d33 752M 494M 198M 72% /usr/local/install

Hope this helps.

4 comments:

  1. Hi Nilesh,

    It reminds me the very same problem, when I used the 'find' command to help in this case though, since I didn't have access to 'lsof' at this time.

    http://blog.thilelli.net/post/2008/10/18/Discrepancies-Between-df-And-du-Outputs

    --
    Best regards,
    Julien Gabel.

    ReplyDelete
  2. This is known issue in Linux as well.

    The solution is to run lsof | grep deleted, as shown in the example below. Note the following information:
    The first column reports which process is holding this file descriptor open.
    The seventh column reports the file size in bytes.
    The final column reports which file is being held open.
    # lsof | grep deleted
    nmbd 16408 root cwd DIR 9,1 0 163846 /var/log/samba (deleted)

    nmbd 16408 root 13w REG 9,1 924442067 163964 /var/log/samba/nmbd.log (deleted)

    You can notice that nmbd process is having around 9 GB space utilized. Kill that process or on safer side restart the service – you will be good.

    -Vaibhav

    ReplyDelete
  3. " suddenly it has released ~450MB space." - nice...

    ReplyDelete