Not able to delete log file

solaris_1977 · October 17, 2013, 7:58pm

On my solaris-10 box, there are two log files which are being used (written by) some application. Those are hige files of 3 gb and 5 gb. I tried to nullify it, but it shows me zero size and after few seconds it will show me its original size. "ls -ltr" shows me big size, but "du -sh file" shows me in kb. We do not want to stop application to clear these files. Is there any way to clear them without stopping application. Below outputs should say my problem more clearly.

root@prd_db07:/oimDomain/logs# ls -l soa.out admin.out
-rw-r--r--   1 iamswl1  dggess     330644050 Oct 17 16:53 admin.out
-rw-r--r--   1 iamswl1  dggess     5163031517 Oct 17 16:53 soa.out
root@prd_db07:/oimDomain/logs# du -sh soa.out admin.out
 134K   soa.out
 260K   admin.out
root@prd_db07:/oimDomain/logs# fuser soa.out
soa.out:    21268o   21243o   21242o
root@prd_db07:/oimDomain/logs# fuser admin.out
admin.out:    19903o   19878o
root@prd_db07:/oimDomain/logs#
root@prd_db07:/oimDomain/logs# ptree 21268
3118  zsched
  21242 /bin/sh /dggess/envs/stage/domains/oimDomain/bin/startManagedWebLogic.sh soa_serv
    21243 /bin/sh /dggess/envs/stage/domains/oimDomain/bin/startWebLogic.sh nodebug noderby
      21268 /dggess/apps/jdk/jdk1.6.0_29/bin/sparcv9/java -server -Xmx4096m -Xms4096m -XX:Per
root@prd_db07:/dggess/apps/logs/oimDomain#
root@prd_db07:/dggess/apps/logs/oimDomain# ptree 19903
3118  zsched
  19878 /bin/sh /dggess/envs/stage/domains/oimDomain/bin/startWebLogic.sh
    19903 /dggess/apps/jdk/jdk1.6.0_29/bin/sparcv9/java -server -Xms2048m -Xmx2048m -XX:Max
root@prd_db07:/oimDomain/logs# >soa.out
root@prd_db07:/oimDomain/logs# >admin.out
root@prd_db07:/oimDomain/logs# ls -l soa.out admin.out
-rw-r--r--   1 iamswl1  dggess           0 Oct 17 16:54 admin.out
-rw-r--r--   1 iamswl1  dggess           0 Oct 17 16:54 soa.out
root@prd_db07:/oimDomain/logs#
root@prd_db07:/oimDomain/logs# ls -l soa.out admin.out   ---> AFTER A WAIT OF 3-4 SECONDS, IT IS AGAIN BACK TO SAME SIZE
-rw-r--r--   1 iamswl1  dggess     330651280 Oct 17 16:54 admin.out
-rw-r--r--   1 iamswl1  dggess     5163034962 Oct 17 16:54 soa.out
root@prd_db07:/oimDomain/logs#

Don_Cragun · October 17, 2013, 11:26pm

The commands:

>soa.out
>admin.out

deallocate all blocks allocated to those files at that time, but it doesn't close the file descriptors and does not reset the file offset that determines the position in the file where the next data written will be placed by in the processes that are writing to those files.

The next time the process writes something to one of those files, it will write it to the spot in the file just after the last place it wrote into that file. That will not allocate any disk blocks for the bytes in the file you previously deallocated, so what you end up with is known as a holey file which contains unallocated blocks that have never been written. If you try to read data from those blocks (such as by running cat soa.out ), those unallocated blocks will appear as though null bytes had been written into those bytes.

If you change the program(s) that are writing those log files to add the O_APPEND flag to the oflag argument to the call to open() that opens the log files, it will reset the position in the log file where it writes data to the current end of file every time it writes to the log file. So, if you clear the log file using >logfile , the next write to the log file will be at the start of the file instead of leaving a huge hole at the start of the file.

solaris_1977 · October 17, 2013, 11:37pm

Thanks Don for explaining it so well. I will need to go long way to get this change on the program which is writing this file.
Is there something, I can do something from OS side now ?
Is there any other better way from OS side, which I should adopt going forward ? Because this will fill up again and then application guys will come to System Admin again.

jlliagre · October 18, 2013, 1:15am

You have cleared the files content. The size reported by ls is irrelevant except for processes reading the files. As far as disk usage is concerned, the space has been recovered and the disk will only fill up with the new data written.

solaris_1977 · October 18, 2013, 1:49am

jlliagre, when I do do "cat soa.out", it was taking very long time to give output, so I did Ctlr+C and came out. This made me think that data is still so big here, even after nullify it. Am I wrong here ?
In future, what will be best to nullify it again, without taking application downtime ?

Don_Cragun · October 18, 2013, 3:40am

If you're going to use cat filename , you have a huge file and cat will read every byte of it. If you look at du filename , the number of disk blocks used to store the contents of the file may be small (as I explained before).

You haven't told us anything about what this application does nor why it is writing gigabytes into a file that the people using the application don't want to see. If you have a process that your users say you have to run continuously and it writes gigabytes of logs that no one wants to see, you can choose one of several options (including, but not limited to):

Get the people to wrote the application to change it.
Restart it regularly (and rotate or delete the log files while it is stopped).
Buy bigger disks to hold all of the data you nobody wants.
Patch the kernel to set the O_APPEND flag in the kernel table entry for the file descriptor involved and truncate it the way you're doing it now.
Refuse to run an application that fills up your filesystems with huge amounts of unwanted data until your users provide a version of the application that work correctly.
Try changing the log file to be a symlink pointing to /dev/null the next time you reboot your system before you start the application. (If the application removes or rotates log files when it restarts, this won't work.)

Scrutinizer · October 18, 2013, 4:01am

Also many start/stop scripts use #!/bin/sh which on Solaris 10 is classic Bourne shell. If those log files are created by redirecting stdout with the >> operator then normally that file can be truncated properly, but not when using classic Bourne Shell. Try running the startup script with #!/usr/xpg4/bin/sh instead...

jlliagre · October 18, 2013, 8:02am

Yes. There is no real data on disk outside the one written after you cleared the existing data.
As I wrote the size reported by ls is irrelevant except for processes reading the files. You cat command is precisely an example of process reading the log file.

With the redirection, you have transformed the logs to sparse files. The file system layer is "inventing" ephemerous null data on the fly for the missing part and displaying this null data is what is taking time. There is no disk space issue. If you are interested in reading the new logs, you can simply run tail -f on the log file.