Mount point at 100%, but cannot see what is filling up

fretagi · July 14, 2014, 9:01am

Hi

Please I need some help, I have system running solaris 10, with a file system at 100%:

df -h /nikira
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c5t500A09818DE3E799d0s0
                       226G   223G     0K   100%    /nikira

but when I look inside to see what is filling up:

[nikira@nikira-app1 ~]$ pwd
/nikira
[nikira@nikira-app1 ~]$ du -sh *
   2K   corefiles
  48M   COUNT_SCRIPTS
  10M   downloads
   1K   Entity Exports
   1K   Entity Notes
   4K   fr
   0K   lost+found
 141K   mahesh
  12G   NIKIRACLIENT
  28G   NIKIRAROOT
 393M   NIKIRATOOLS
 3.6G   oradiag_nikira
 277K   oradiag_root
  89M   QFE582
  25K   script
 987M   spark_server
 539M   subex_working_area
 439M   Task Logs
You have new mail in /var/mail//nikira
[nikira@nikira-app1 ~]$

So if you add up those numbers it will not reach 223G

Please can you help

hicksd8 · July 14, 2014, 9:39am

This can be caused by deleted file space not being reclaimed because the file(s) are still held open by an application.

What type of filesystem is it? UFS or ZFS?

Are there any NFS shares on this filesystem?

Have you tried rebooting (or is that not possible)?

You have Oracle directories on the filesystem. Do you have a large Oracle database on this filesystem in reserved space?

fretagi · July 14, 2014, 9:44am

Hi

Thanks for the reply, it is a

ufs

file system, at this particular file system there is no NFS shares, but this whole file system is mounted via NFS on another server.
I did not reboot, yet, the complete mount point is as follows:

bash-3.00# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/md/dsk/d10        9.8G   926M   8.8G    10%    /
/devices                 0K     0K     0K     0%    /devices
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                   197G   1.7M   197G     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
/dev/md/dsk/d40        9.8G   3.9G   5.9G    40%    /usr
/platform/sun4v/lib/libc_psr/libc_psr_hwcap2.so.1
                       9.8G   926M   8.8G    10%    /platform/sun4v/lib/libc_psr.so.1
/platform/sun4v/lib/sparcv9/libc_psr/libc_psr_hwcap2.so.1
                       9.8G   926M   8.8G    10%    /platform/sun4v/lib/sparcv9/libc_psr.so.1
fd                       0K     0K     0K     0%    /dev/fd
/dev/md/dsk/d20        9.8G   3.1G   6.7G    32%    /var
swap                   197G    30M   197G     1%    /tmp
swap                   197G    56K   197G     1%    /var/run
/dev/dsk/c5t500A09818DE3E799d1s0
                        25G   4.7G    20G    20%    /oracle
/dev/md/dsk/d50        9.8G   109M   9.6G     2%    /opt
/dev/dsk/c5t500A09818DE3E799d0s0
                       226G   223G    88K   100%    /nikira
/dev/md/dsk/d60        112G   6.1G   105G     6%    /internaldisk
bash-3.00#

bakunin · July 14, 2014, 9:52am

Most likely you (or someone else) has deleted a file which is still open (and probably written to) by a process. As long as this process holds the file open it will occupy its space. Only when the process ends it will really relinquish the held space.

The most surefire solution is to reboot the system, because this will end the process in every case. If you must not interrupt the systems operation you can also identify the process by using the "fuser" ("strace", ...) utility on the mounted device. Then kill the respective process (or, as aminimum measure, send it a "kill -1" so that it reinitializes).

I hope this helps.

bakunin

rbatte1 · July 14, 2014, 9:53am

There is a common problem that there is a large open file that has been deleted. When a file is created, it writes an entry in the relevant directory so you can find it, but it is really a collection of disk blocks. The entry you can read in a directory is just a pointer to the disk blocks. The first block also contains what is called an i-node which holds information about the file, such as acces time, create time, modification time, permissions etc. Whilst the file is being written, those blocks will increase as the file needs.

If the file is still open as output by a program and someone issues a delete, all that will happen is that the directory entry that lets you see the file exists will get removed. The blocks are not freed until the file is closed, indeed the process can keep writing as long as there is space to write to.

Have a check of your manual pages for du to be sure, but you may be able to list it with:-

fuser -duV /nikira

This will hopefully give you the processes that have open and deleted files in the filesystem. You can then choose if you want to terminate them, which will release the space back to the filesystem.

if this is not correct, you may need to use lsof to list all open files in /nikira and then loop through to see which ones are files, directories or other items you can list, and which are just an i-node reference, something like:-

lsof | grep "/nikira$" | while read cmd pid userid fd type device offset inum fs
do
   file=`find /nikira -xdev -inum $inum`
   if [ "$file" = "" ]
   then
      echo "I-node $inum is not a file"
   fi
done

It will probably take a long time to run with such a loop. Perhaps this will give better performance:-

ls -laiR /nikira > /tmp/nikira_ls-laiR
lsof | grep "/nikira$" | while read cmd pid userid fd type device offset inum fs
do
   grep -q "^$inum " /tmp/nikira_ls-laiR
   if [ $? -ne 0 ]
   then
      echo "I-node $inum is not a file"
   fi
done

..... but if there are submounted filesystems of perhaps symbolic links, that may be a problem as the i-node you are chasing may be used in the sub-mounted filesystem and therefore will provide a listing in /tmp/nikira_ls-laiR

Robin

hicksd8 · July 14, 2014, 10:04am

@rbatte1.....yes, but since this filesystem is NFS shared and mounted by another node, that node could be doing the writing. If that's the case and you don't mind interrupting the remote users, unsharing and resharing will do it.

fretagi · July 14, 2014, 10:11am

Hi
Thanks again!

I am afraid I dont have

lsof

installed and the webpage that use to provide free utilities no longer does Introduction.
But using

fuser

it results in:

 fuser -c /nikira
/nikira:     4079com    4077c    3532co    3522com    1009c    1008c     995c   73128c   59575c   22574tom   64065tom   61756tom   59148tom   83983tom   56666c   43760c    1157om    1148o    1311ctom   75924ctom   56224ctm   56223ctm   56222ctm   56220ctm   56219ctm   56218ctm   56216ctm   56215ctm   56214ctm   56212ctm   56211ctm   56210ctom   56208cm   56056cm   56055com   55899ctm   26522com   26518co   26085com   26080co   25747com   25578co   24857com   24852co   78834ctom   74990ctom   72309ctom   98652tom   98605tom   31367ctm   43009com

So , I am not sure how to identify all these PIDs

rbatte1 · July 14, 2014, 10:15am

Dear hickd8,

Good point!

I never like NFS unless it's read-only, so publishing code for instance. It just seems to introduce too much complication to an operation otherwise. I always have files written to a common shared location being done with FTP (or SFTP, of course) It might be less efficient, but it's far easier to control and trace.

Robin

---------- Post updated at 03:15 PM ---------- Previous update was at 03:12 PM ----------

fretagi:

Hi
Thanks again!

I am afraid I dont have

lsof

installed and the webpage that use to provide free utilities no longer does Introduction.
But using

fuser

it results in:

 fuser -c /nikira
/nikira:     4079com    4077c    3532co    3522com    1009c    1008c     995c   73128c   59575c   22574tom   64065tom   61756tom   59148tom   83983tom   56666c   43760c    1157om    1148o    1311ctom   75924ctom   56224ctm   56223ctm   56222ctm   56220ctm   56219ctm   56218ctm   56216ctm   56215ctm   56214ctm   56212ctm   56211ctm   56210ctom   56208cm   56056cm   56055com   55899ctm   26522com   26518co   26085com   26080co   25747com   25578co   24857com   24852co   78834ctom   74990ctom   72309ctom   98652tom   98605tom   31367ctm   43009com

So , I am not sure how to identify all these PIDs

As a rough and ready listing, you could:-

for pid in `fuser -c /nikira 2>/dev/null`
do
   ps -fp $pid | grep -v PPID
done

Robin

fretagi · July 14, 2014, 10:21am

the output has provided a long list of commands belonging to the application...

bash-3.00# for pid in `fuser -c /nikira 2>/dev/null`
> do
> ps -fp $pid | grep -v PPID
> done
  nikira  61140  48413   0 16:17:54 ?           0:00 sleep 10
  nikira  48413      1   0 16:15:04 ?           0:00 /bin/bash ./record_hashsplitter.sh 7
    root   1009   1008   0 16:04:31 pts/6       0:00 more -s /tmp/mpzPagFa
    root   1008    995   0 16:04:31 pts/6       0:00 sh -c more -s /tmp/mpzPagFa
    root    995  43760   0 16:04:31 pts/6       0:00 man fuser
  nikira  73128  56055   0 15:35:32 ?           0:00 sleep 10000
    root  59575  58160   0 15:32:44 pts/8       0:00 bash
  nikira  22574  98605   0 15:25:40 ?           0:00 /nikira/NIKIRATOOLS/APACHE64/bin/httpd -f /nikira/NIKIRATOOLS/APACHE64/conf/htt
  nikira  64065  98605   0 14:51:04 ?           0:00 /nikira/NIKIRATOOLS/APACHE64/bin/httpd -f /nikira/NIKIRATOOLS/APACHE64/conf/htt
  nikira  61756  98605   0 14:50:33 ?           0:00 /nikira/NIKIRATOOLS/APACHE64/bin/httpd -f /nikira/NIKIRATOOLS/APACHE64/conf/htt
  nikira  59148  98605   0 14:49:58 ?           0:00 /nikira/NIKIRATOOLS/APACHE64/bin/httpd -f /nikira/NIKIRATOOLS/APACHE64/conf/htt
  nikira  83983  98605   0 14:33:13 ?           0:00 /nikira/NIKIRATOOLS/APACHE64/bin/httpd -f /nikira/NIKIRATOOLS/APACHE64/conf/htt
  nikira  56666  56175   0 10:28:00 pts/9       0:00 -bash
    root  43760  43601   0 09:52:21 pts/6       0:02 bash
  nikira   1157   1148   0 06:53:01 ?           0:03 sqlplus -s /nolog
  nikira   1148   1144   0 06:53:01 ?           0:00 /bin/bash /nikira/NIKIRAROOT/bin/spark_tables_cleanup.sh 7
  nikira   1311  98652   0   Jul 11 ?           1:14 /nikira/NIKIRATOOLS/RUBYROOT/bin/ruby /nikira/NIKIRACLIENT/src/public/dispatch.
  nikira  75924      1   1   Jul 08 ?        8311:38 dbwriter -r /nikira/NIKIRAROOT/RangerData/DBWriterData/ -s /nikira/NIKIRAROOT/R
  nikira  56224  55899   0   Jul 08 ?           2:20 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/AICumulativeGPRSData/ -
  nikira  56223  55899   0   Jul 08 ?           2:20 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/AICumulativeVoiceData/
  nikira  56222  55899   0   Jul 08 ?           2:21 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/SubscriberDataRecord/ -
  nikira  56220  55899   0   Jul 08 ?         740:24 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/DataRecord_7/ -p TCP://
  nikira  56219  55899   0   Jul 08 ?         735:49 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/DataRecord_6/ -p TCP://
  nikira  56218  55899   0   Jul 08 ?         754:20 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/DataRecord_5/ -p TCP://
  nikira  56216  55899   0   Jul 08 ?         748:54 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/DataRecord_4/ -p TCP://
  nikira  56215  55899   0   Jul 08 ?         744:29 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/DataRecord_3/ -p TCP://
  nikira  56214  55899   0   Jul 08 ?         748:40 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/DataRecord_2/ -p TCP://
  nikira  56212  55899   0   Jul 08 ?         755:14 recorddispatcher -r RFF:///nikira/NIKIRAROOT/RangerData/DataRecord_1/ -p TCP://
  nikira  56211  55899   0   Jul 08 ?           0:00 recordprocessor -r TCP://10.100.48.73:30002 -w TCP://10.100.48.75:20000 -c 1:TC
  nikira  56210  55899   0   Jul 08 ?           0:03 recordprocessor -r TCP://10.100.48.73:30001 -w TCP://10.100.48.75:20000 -c 1:TC
  nikira  56208  55899   0   Jul 08 ?           0:00 ssh -l nikira 10.100.48.75 /nikira_data01/NIKIRAROOT/bin/runprogram.sh programm
  nikira  56056  55899   0   Jul 08 ?           0:00 ssh -l nikira 10.100.48.74 /nikira_data01/NIKIRAROOT/bin/runprogram.sh programm
  nikira  56055  55899   0   Jul 08 ?           0:00 /bin/bash /nikira/NIKIRAROOT/sbin/licenseengine
  nikira  55899      1   0   Jul 08 ?           0:00 programmanager -f programmanager.conf
  nikira  26522  26518   0   Jul 08 ?          10:27 java -server -classpath ../config/:../lib/cereports.jar:../lib/itext-1.2.jar:..
  nikira  26518      1   0   Jul 08 ?           0:00 sh ./server.sh
  nikira  26085  26080   0   Jul 08 ?          10:31 java -server -classpath ../config/:../lib/cereports.jar:../lib/itext-1.2.jar:..
  nikira  26080      1   0   Jul 08 ?           0:00 sh ./server.sh
  nikira  25747  25578   1   Jul 08 ?        17863:38 java -server -classpath ../config/:../lib/cereports.jar:../lib/itext-1.2.jar:..
  nikira  25578      1   0   Jul 08 ?           0:01 sh ./tc.sh
  nikira  24857  24852   0   Jul 08 ?         440:39 java -server -classpath ../config/:../lib/cereports.jar:../lib/itext-1.2.jar:..
  nikira  24852      1   0   Jul 08 ?           0:00 sh ./sc.sh
  nikira  78834      1   0   Jun 30 ?          11:02 ./mcelrater -d nrtrde
  nikira  74990      1   0   Jun 30 ?        3904:46 ./mcelrater -d gprs
  nikira  72309      1   0   Jun 30 ?        1440:33 ./mcelrater -d gsm
  nikira  98652  98605   0   May 23 ?           2:58 /nikira/NIKIRATOOLS/APACHE64/bin/httpd -f /nikira/NIKIRATOOLS/APACHE64/conf/htt
  nikira  98605      1   0   May 23 ?           4:16 /nikira/NIKIRATOOLS/APACHE64/bin/httpd -f /nikira/NIKIRATOOLS/APACHE64/conf/htt
  nikira  31367      1   0   May 23 ?           7:08 memcached -p 12321
  nikira  43009      1   0   Oct 13 ?        106677:56 /bin/bash ./Nrtrde_VpmnID_populate_cdr.sh
bash-3.00#

rbatte1 · July 14, 2014, 10:27am

That's some serious CPU clocking. Can you stop/start each of these? I presume that they are services. Process 43009 would be the first to consider. It's been doing whatever for nine months. Could it be the log file for that script?

Robin

fretagi · July 15, 2014, 3:39am

I will consult the application admin, if he can stop all processes related to the application

---------- Post updated 15-07-14 at 09:37 AM ---------- Previous update was 14-07-14 at 04:36 PM ----------

after stopping all processes, them after a reboot, its now back to normal, thank you to all

---------- Post updated at 09:39 AM ---------- Previous update was at 09:37 AM ----------

How to get

lsof

rbatte1 · July 16, 2014, 7:18am

Search for unix lsof from your favourite internet search engine. I think it's written & held by SourceForge and is free to download & use (observing any copyright requirements)

You will need to pick the appropriate version for your OS. Probably easier to do that than to download the source code and compile it.

Robin

radoulov · July 16, 2014, 8:34am

As a side note, the shell pattern * won't match filenames that start with a dot
(unless some non-standard shell option similar to bash's dotglob is active) ,
so the above command won't show you the disk usage of hidden files and directories.

jlliagre · July 16, 2014, 11:41am

I'm late in that thread but anyway, you don't need lsof as it is relatively easy to identify using standard Solaris commands what processes have deleted files and on which file descriptors. Simply run as root:

find /proc/*/fd -links 0 ! -size 0 -ls

Corona688 · July 16, 2014, 1:47pm

This also works slightly similarly in Linux but you can't actually get the file size from there:

$ echo asdf > file # Create dummy file
$ exec 5<file # Open file
$ rm file # Delete file
$ ls -l /proc/self/fd # Show it in /proc/.../fd

total 0
lrwx------ 1 user user 64 Jul 16 11:43 0 -> /dev/pts/1
lrwx------ 1 user user 64 Jul 16 11:43 1 -> /dev/pts/1
lrwx------ 1 user user 64 Jul 16 11:43 2 -> /dev/pts/1
lr-x------ 1 user user 64 Jul 16 11:43 200 -> /home/user/.ssh-agent
lr-x------ 1 user user 64 Jul 16 11:43 3 -> /proc/23358/fd
lr-x------ 1 user user 64 Jul 16 11:43 5 -> /home/user/file (deleted)

$ exec 5<&- # Close file

Deleted files just show as dead symlinks with (deleted) in their name.

RudiC · July 16, 2014, 2:05pm

As for the file size, use stat (if available on your system) with -L option to dereference the link:

stat -L /proc/self/fd/5
  File: �/proc/self/fd/5�
  Size: 5             Blocks: 8          IO Block: 4096   regular file
Device: 809h/2057d    Inode: 260099      Links: 0
Access: (0664/-rw-rw-r--)  Uid: ( 1000/ coerdtr)   Gid: ( 1000/ coerdtr)
Access: 2014-07-16 20:04:12.253665423 +0200
Modify: 2014-07-16 20:04:12.253665423 +0200
Change: 2014-07-16 20:04:39.729799747 +0200
 Birth: -

Corona688 · July 16, 2014, 5:53pm

Interesting, I didn't think that would work, ls shows the link as broken. I guess the link's shown destination and the links "contents" are actually different.

RudiC · July 17, 2014, 11:22am

I would think it references the inode, which in turn contains the relevant information.