How to compare data from 2 zip files and capture the new records from file2 to a new file

koneru · December 1, 2011, 9:58am

I have 2 zip files which have about 20 million records in each file. file 2 will have additional records than file 1. I want to compare the records in both the files and capture the new records from file 2 into another file file3. Please help me with a command/script which provides me the desired result in a faster way.

Example:-

File1 - a.zip

1,2,3

File 2 - b.zip

2,3,4
1,2,3

Required output

2,3,4

Corona688 · December 1, 2011, 10:11am

faster way than what? What have you already done?

What do the contents of the zip file look like? As in, what filenames?

koneru · December 1, 2011, 10:17am

The contents of the zip file look like what i have mentioned in the example.

Corona688 · December 1, 2011, 12:16pm

To get a file out of a .zip, you need a filename, not just the zip file, because zip can hold more than one file.

If the .zip files don't contain filenames, they're not really .zip, so what are they?

koneru · December 1, 2011, 3:57pm

I am able to compare uncompressed files using awk, but want a way to compare compressed files.

Each .zip file has one text file within it and resides in different directories. Both the .zip files have the same name a.zip and both the text files within the two zip files have the same name a.txt

zcat /home/20111201/a.zip

1,2,3

zcat /home/20111202/a.zip

2,3,4
1,2,3

Corona688 · December 1, 2011, 4:17pm

Those should be .gz or .z, not .zip.

$ man zcat

GZIP(1)                                                                GZIP(1)



NAME
       gzip, gunzip, zcat - compress or expand files

SYNOPSIS
       gzip [ -acdfhlLnNrtvV19 ] [-S suffix] [ name ...  ]
       gunzip [ -acfhlLnNrtvV ] [-S suffix] [ name ...  ]
       zcat [ -fhLV ] [ name ...  ]

DESCRIPTION
       Gzip  reduces  the  size  of  the  named  files using Lempel-Ziv coding
       (LZ77).  Whenever possible, each file  is  replaced  by  one  with  the
       extension .gz, while keeping the same ownership modes, access and modi-
       fication times.  (The default extension is -gz for VMS,  z  for  MSDOS,
       OS/2  FAT, Windows NT FAT and Atari.)

Actual zip is difficult to use in a pipe chain.

Knowing what you have, I'd do this:

mkfifo data1 data2
zcat < a.gz > data1 &
zcat < b.gz > data2 &

awk 'BEGIN { while(getline <"data1") L[$0]=1 }; !L[$0]' data2

wait
rm -f data1 data2

---------- Post updated at 03:17 PM ---------- Previous update was at 03:13 PM ----------

A simplified version:

mkfifo data1
zcat < a.zip > data1 &
zcat < b.zip | awk 'BEGIN { while(getline <"data1") L[$0]=1 }; !L[$0]'
wait
rm data1

dude2cool · December 1, 2011, 10:14pm

See this as well, maybe it will be helpful:

The Power of Z Commands � Zcat, Zless, Zgrep, Zdiff Examples (link removed)

koneru · December 5, 2011, 6:25am

corona688:

Those should be .gz or .z, not .zip.

$ man zcat
 
GZIP(1)                                                                GZIP(1)
 
 
 
NAME
   gzip, gunzip, zcat - compress or expand files
 
SYNOPSIS
   gzip [ -acdfhlLnNrtvV19 ] [-S suffix] [ name ...  ]
   gunzip [ -acfhlLnNrtvV ] [-S suffix] [ name ...  ]
   zcat [ -fhLV ] [ name ...  ]
 
DESCRIPTION
   Gzip  reduces  the  size  of  the  named  files using Lempel-Ziv coding
   (LZ77).  Whenever possible, each file  is  replaced  by  one  with  the
   extension .gz, while keeping the same ownership modes, access and modi-
   fication times.  (The default extension is -gz for VMS,  z  for  MSDOS,
   OS/2  FAT, Windows NT FAT and Atari.)

Actual zip is difficult to use in a pipe chain.

Knowing what you have, I'd do this:

mkfifo data1 data2
zcat < a.gz > data1 &
zcat < b.gz > data2 &
 
awk 'BEGIN { while(getline <"data1") L[$0]=1 }; !L[$0]' data2
 
wait
rm -f data1 data2

---------- Post updated at 03:17 PM ---------- Previous update was at 03:13 PM ----------

A simplified version:

mkfifo data1
zcat < a.zip > data1 &
zcat < b.zip | awk 'BEGIN { while(getline <"data1") L[$0]=1 }; !L[$0]'
wait
rm data1

Thanks !! This code is working as expected. I tried with less than a million records in each file and it gave me the output in less than a minute. But when i tried with 27 million records in each file, it is still executing from an hour. Will this consume lot of disk space ? Is there a way to get the output faster ?

Corona688 · December 5, 2011, 1:47pm

It has to hold the complete, uncompressed contents of "a" in memory to tell if any lines from "b" exist in it. How else would it know, when it can't make any assumptions like ordering? This doesn't take disk space but takes as much memory as it needs to hold "a" uncompressed.

That can't be simplified or sped up without sorting the input files first -- which takes time itself, and would alter the order of output.