Is data held in memory between a tar and gzip cmd

KeithBH · May 11, 2023, 8:28pm

Hello,
I am working on a process to copy directories from an AIX server to a NFS shared drive for Cohesity to backup.

I'm using a command like this:
tar cf - /opt/ftp | gzip > /mnt/cohesity/opt.tar.gz 2>/tmp/dailybkup.err

There will be about 200GB of data processed through the tar cmd and then the gzip cmd.
The question is whether that data is held in memory as the tar cmd runs and then that data is compressed with the gzip command to be written to disk?

We are concerned that processing too much data through the 2 commands may affect system memory.

This is on an IBM 8284-22A Power 8 server with 64GB memory.

Thanks to anyone that can shed some light on how data is handled when piping the output of a tar cmd through a gzip command!

munkeHoller · May 11, 2023, 9:07pm

@KeithBH , hi, welcome to the community, we hope you find it friendly and helpful.

these commands all buffer input/output, actual amounts will vary, but they are all designed to handle massive amounts of data 'efficiently' , i would reckon no more than a few 10's of mbytes.
you could try with a representative subset and monitor memory usage of the processes.

if you have rsync on your system you may want to look at that as a potential alternative.

Paul_Pedant · May 12, 2023, 9:29am

Pipes are used system-wide, and merely join the stdout of a writer process to the stdin of a reader process. There is nothing special about tar and gzip in this scenario. The pipe itself takes insignificant space -- generally 64 kilobytes on Linux.

Both processes are started at once after the pipe connection is made. Any time the reader has nothing available to read, the Kernel makes it wait for input (just as it would wait for a disk read). Any time the pipe is almost full, the Kernel makes the writer wait until the reader reads some from it and thereby frees up some space. These waits work on the order of milliseconds. When the writer process ends, the reader gets an EOF state after it has received all the data.

So you have both processes in memory most of the time, but the whole piped data stream is transient, and independent of the actual data volume being transferred.

You might notice that tar takes compression options on its command line, like -j, -J, -z, -Z and others. These are somewhat cosmetic: under the hood, tar merely opens an internal pipe to or from the relevant compress command, resulting in exactly the same mechanism as does your shell invocation.

MadeInGermany · May 12, 2023, 9:36am

According to this
the pipe buffer is 32kb on AIX.
So it will use very little RAM.

Only GNU tar has builtin compression.

KeithBH · May 12, 2023, 3:36pm

Thank you for the answers!!!

system · March 7, 2024, 3:36pm

This topic was automatically closed 300 days after the last reply. New replies are no longer allowed.