Gzip behavior on open files?

Just a quick question: How does gzip behave under linux if it's source is a file that is currently being written to by a different process? Basically, in my below code I want to make sure that there is ultimately no loss of data; ie. the gzip command runs until it "catches up" to the end of the file while the file is expanding, and then the cat /dev/null clears the file immediately, therefore the next write to the file happens when it is empty, and all prior data in the file is safely preserved in the archived gzip file. How does my code look?

CAPDIR=/data/capture
KEEPDIR=/data/capture/keep

for FILE in `find $CAPDIR -maxdepth 1 -not -type d | awk -F/ '{print $NF}'`
do
   echo Processing $CAPDIR/$FILE --\> $KEEPDIR/$FILE.GZ
   gzip -c /$CAPDIR/$FILE  > $KEEPDIR/$FILE.GZ
   cat /dev/null > $CAPDIR/$FILE
done
echo
echo Done. 
echo

I know in some OS's that when a file handle is locked for reading you get the file contents up to the EOF at the time of lock, not up to the EOF at the current time.

I guess another way to put my question would be is there a way to "atomize" these commands:

   gzip -c /$CAPDIR/$FILE  > $KEEPDIR/$FILE.GZ
   cat /dev/null > $CAPDIR/$FILE

...such that I can be guaranteed that no other process gets a chance to write data to $CAPDIR/$FILE in between the call to gzip and the call to cat /dev/null?

Do you really need to blank out the file after archiving it? That seems a little dangerous. If something goes wrong with the archive process, you might lose data?

What if you:
1) rename the file first
2) touch the original file name and set permissions
3) archive the renamed file
4) delete the renamed file

That way the file you are archiving is not being appended to.

Thanks for your help -
That's a fair enough alternative approach but wouldn't there be a potential problem if I renamed the capture file while it was being written to by another process? Would the process continue writing to the now-renamed file?
I've been told that I cannot remove the capture file (hence the cat /dev/null) because otherwise it will break the other process that writes to the capture file. Considering this, renaming it would have the same effect as removing it wouldn't it?

It seems that the best possible solution is to somehow prevent any data being flushed to the file between the completion of archiving and the emptying of the file, but I'm not sure that this is even possible?

I agree there is a potential problem with renaming the logfile, that maybe some data might be lost or something go wrong with the process of writing to the logfile. I was just trying to suggest something as an alternative to blanking out the file, but I would do neither.

What if you just gzip the file, and let it keep growing? You will have a 100% sure valid zip file, and you are 100% sure the log file will not be damaged, and you will be 100% sure that no data are lost. It's blanking out the log file or renaming the log file that introduces the data loss potential.

The zip file might fail to collect all the data in the log file, but it doesn't matter. The data are still in the log file. The zip file is complete as of some time point.

Maybe there is some reason you need to blank out (truncate) the log file?

If all it's doing is writing to the file, it should continue seamlessly. You can rename files in use and nothing happens as long as the inode, the file's unique ID on the filesystem, remains the same. You can even delete them in use and the program continues -- but everything except things which already have it open lose their ability to access the file...

You could move it out of the folder as long as it remains on the same partition. The process would continue writing uninterrupted because the file always exists somewhere; the unique ID of the file, its inode, would remain unchanged.

I'd use ln and rm instead of mv, to guarantee mv doesn't decide to create a new file for whatever reason, and to guarantee that you're not trying to move it to a different partition. If /path/to/dest is not on the same partition as /path/to/source, ln will fail.

if ! ln /path/to/source /path/to/dest
then
        echo "Couldn't link" >&2
        exit 1
fi

# They share the same inode -- they are literally the same file
# You can delete one of the names without deleting the file itself now.
ls -i /path/to/source /path/to/dest

# Delete the original location, and the new location still exists
rm /path/to/source

Actually, it wouldn't affect the process writing the file -- but it would affect you. The file would still be on disk, but not listed in any folder until the process quits.

People often delete logfiles expecting to free up disk space, but because they deleted files that were being written to, no space was freed and they couldn't even truncate the files anymore -- no longer listed in any folders. You have to restart the log daemon or the system itself to free that space.

Indeed, that'd be ideal. Cooperative locking is possible, but note the word 'cooperative', the writing process has to play along. If it doesn't ask to lock the file, it'll never get stopped from writing.

Check the program's options, you might be able to tell it to do so.

1 Like

The way this is typically done during logrotation is to rename the logfile, then send a signal to the logging process to inform it that it needs to close its file descriptor and create a new logfile, and finally, compress.

Another safe alternative, though more brutish, is to shut down the logging process during rotation.

If your system has logrotate or similar tool, and if the logging process uses a signal for rotation, then you don't even need to write the shell script. Just add a section to the config file to handle your files.

On an unrelated note, there's no need to use cat to truncate a file. The shell can do it with a simpler, less expensive redirection:

> "$CAPDIR/$FILE

Regards,
Alister

1 Like

gzip certainly gives error when another process is writing to it.
Chain the next command with && i.e. only if successful.

find "$CAPDIR" -maxdepth 1 -type f |
 awk -F/ '{print $NF}' |
while read FILE
do
   echo "Processing $CAPDIR/$FILE --> $KEEPDIR/$FILE.GZ"
   gzip -c "$CAPDIR/$FILE"  > "$KEEPDIR/$FILE.GZ" &&
   > "$CAPDIR/$FILE"
done

A while loop is more appropriate.
Variables in command arguments should be quoted.

1 Like

Yes this is what I thought. The problem is that the logging process is basically a black box as far as I'm concerned. I don't know if it holds a file descriptor, or just keeps atomically opening-appending-closing the file, nor do I know if it would respond to signals. It's also not possible to shut down the logging process, for various reasons but mainly because the traffic is real-time and needs to be logged as such.

Looks like there's no way to do what I need without the logging process explicitly cooperating. I guess the next best approach would be to strip only data out of the capture which definitely appears in the gz after the archiving is complete. And that's something I have no clue how to do yet... stay tuned for another thread :rolleyes:

FWIW - most "logging" processes that follow UNIX standards will close the current output file and then open another on receipt of a SIGHUP signal. Check your documentation.

You could try moving the file out of the way, as you thought of long before and I explained might actually work...

Yeah but if it does happen to crash the process(es) dumping to the capture files, worst case scenario - it could result in a nationwide temporary SMS outage costing my company millions (and I can safely assume I will be sacked) :eek:

It's probably not worth experimenting with... :wink:

Yikes. Point taken. Is there any way you can just ignore the files?

What if you copy the log file before using gzip on the static copy? That way you avoid any possibility for glitch.

How? It could be copied in an incomplete state...

If you mean there is an incomplete record at the end of the log file, I see that as an inconsequential problem. In my mind, the priorities are 1) the archive is valid, and 2) the log files are not damaged, 3) the process writing to the log files does not get confused. The archive is a backup. If it was needed to be used, having an incomplete record at the tail seems a very minor, if any, problem. No solution here is perfect. I'm just suggesting what seems practical and safe.

Now, if the log file is a binary file, and would be corrupted by an incomplete write, then that's even more reason not to mess with it by trying to move or truncate it.

Ah, I understand. You mean making a copy of the files, then making an archive of those, not the originals.

Yes, that's what I meant. Thank you for taking another look.