Converting Huge Archive into smaller ones

Legend986 · October 15, 2008, 3:08pm

I have a 13G gz archive... The problem is that when I expand it, it goes to 300G and I don't have so much of hdd space. The file is a one huge file: rrc.tar.gz. What I want to do is to extract the archive but at each step gzip the resulting file.

So, if

gunzip -c rrc00.tar.gz | tar -xvf -

gives me an uncompressed directory, I want each of the files to be gzipped as and when they are extracted. So for example, if the resulting directory is something like

2007/fileA.txt
2007/fileB.txt

I want fileA.txt to be gzipped into fileA.txt.gz before it goes and extracts fileB.txt. Is there a way that this is possible?

togr · October 15, 2008, 3:58pm

I was succesful with this
gunzip -c rr.tar.gz | tar -tf - > contents
for f in `cat contents`; do gunzip -c rr.tar.gz | tar -xf - $f; gzip $f; done

but it has a huge drawback of doing gunzip of the whole file and extracting just one.
It'd be much better to do "gunzip -c" once, and then parse the output

google for two interesting tar's options: --to-stdout (-O) and --to-command=
I need to go now, but Ill be glad if you share the solution with us. It's interesting.

Legend986 · October 15, 2008, 4:02pm

Thank You for the advice. There is a heavy resource constraint so I will try to explore more. I could think of one solution and I would appreciate if someone could provide a better one...

I would do a

 gunzip -c rrc00.tar.gz | tar -xvf -

And then setup a crontab to look for new files in the current directory with a certain extension. If there is, then I would gzip the file. This is the simplest I could think of. Please let me know if it is the best though

Legend986 · October 15, 2008, 4:19pm

Coming to think of it, I am now facing another problem. If the file is in the middle of execution, there is a chance that the cron will take even this file into consideration and execute a gzip on it which could be a problem. Is there a way to tell the find command to find only those files which are not being accessed by any other process?

Legend986 · October 16, 2008, 3:23am

I never imagined, I would face so many problems with a directory archived the wrong way In any case, I was able to convert a complete directory archive into a directory of archives. In any case, here's a solution for those interested:

Problem:
A directory was gzipped on the whole. So, it becomes almost impossible to extract data from some particular files efficiently.

Constraint:
The archive is 13G and expands into 250G but the disk capacity is 50G

Conventional Answer:
13G Directory Archive --> Expands to 250G --> Converted into 13G Directory of Archives

Answer:
Convert the directory archive into a directory of archives.

Solution:

Step 1:

Prepare a shell script and place it in the directory where the archive is to be extracted: checkAndGzip.sh

#!/bin/bash

for FILE in `find ./ -name "*.extension"`
do
        temp=`lsof $FILE | awk '{if(NR>1) if($4 ~ "w") print $4}'`;
        if [ "$temp" = "" ]; then
                #Implies that the file is not in use
                #Initiate gzip on file
                gzip $FILE;
        fi
done;

Note: Observe the usage of lsof which is a nice utility that tells if the file is in use.

Step 2:

Setup a cron as

*/2 * * * * /path/to/checkAndGzip.sh > output

Note: Cron runs every two minutes

Step 3:

Run this command in the directory:

gunzip -c archive.tar.gz | tar -xvf -

Logic:The logic is pretty simple. On the one hand, the extraction takes place and on the other, the cron executes a shell script that checks if a new file has been generated and then gzips it. The reason why we use lsof is to verify if the file is still being extracted (gzip doesn't seem to care about partial files) and if a file is in use, skip it during this run.

If anyone has a better solution, or have an improvement for the above solution, kindly suggest

rubin · October 16, 2008, 12:03pm

Nice solution ...

and a few suggestions:

You can consolidate all the commands in one script,
and you can use the sleep command within the script, instead of setting up a cron process, running every 2 min,

checkAndGzip.sh

#!/bin/bash

#set -x


# Provide full path, so you can run the script from every dir.
cd /full/path/to/zipped_files

# Start unzipping the files, run it in the background so the files checking can start.
gunzip -c archive.tar.gz | tar -xvf -   &

# Start checking for the files, while the unzipping is happening. 
# Use a find ...| while read ... construction , because it doesn't break if the file name has white space in it.

 find . -name '*.extension' | while read FILE      2>/dev/null
  do
        temp=`lsof "$FILE" | awk 'NR>1 && $4 ~ "w"{ print $4 }'` ;
        
        if [ "$temp" = "" ]; then
                #Implies that the file is not in use
                #Initiate gzip on file
                gzip "$FILE";
        fi
    
        # Wait for 2 minutes.
        sleep 120

  done > output

Modify the code to fit any other requirement.

Legend986 · October 16, 2008, 9:13pm

Thanks for the improvement Actually, on my system, for some reason, the find command doesn't work. I mean, extraction is taking place but the gzipping part doesn't seem to work.

The first time find runs, it doesn't find any files (or finds only a few files in use) so it exits out of the loop... is that correct by any chance?

rubin · October 16, 2008, 11:23pm

That's correct, a second loop is needed,

#!/bin/bash
#set -x

cd /full/path/to/zipped_files

gunzip -c archive.tar.gz | tar -xvf -   &


# Instead of an infinite loop, you can use your condition to break out of the loop, say a count of files, a certain disk size is reached,...
# while [ condition ] ; do ...

while :
 do
 find . -name '*.extension' | while read FILE      2>/dev/null
  do
        temp=`lsof "$FILE" | awk 'NR>1 && $4 ~ "w"{ print $4 }'` ;
        
        if [ "$temp" = "" ]; then
                #Implies that the file is not in use
                #Initiate gzip on file
                gzip "$FILE";
        fi
    
        
  done > output
        
 # Wait for 2 minutes.
 sleep 120
 
 # If you use an infinite loop, then test when condition becomes true, break out of the loop.
 if [ condition ]     
  then 
      break
  fi

done

Again , the code is more of a guideline, than an actual working script, so modifications might be needed, but the main logic is there.

Legend986 · October 17, 2008, 3:15pm

Yes... That is it... I used a while true loop instead. Thanks a lot...

togr · October 18, 2008, 6:05am

Wouldn't be better to skip already gzipped files in "find ... | while" construction ?

find ! -name "*.gz" finds files that don't have *.gz extension, so the following

find . ! -name "*.gz" | while read FILE 2>/dev/null
do
...stuff...
done

should attempt gzip on non gzipped files only.

PS
I like the solution you have developed for it's readability, but I still believe the whole task can be done easier, somehow with parenthesis maybe...