I have 2 zip files which have about 20 million records in each file. file 2 will have additional records than file 1. I want to compare the records in both the files and capture the new records from file 2 into another file file3. Please help me with a command/script which provides me the desired result in a faster way.
I am able to compare uncompressed files using awk, but want a way to compare compressed files.
Each .zip file has one text file within it and resides in different directories. Both the .zip files have the same name a.zip and both the text files within the two zip files have the same name a.txt
$ man zcat
GZIP(1) GZIP(1)
NAME
gzip, gunzip, zcat - compress or expand files
SYNOPSIS
gzip [ -acdfhlLnNrtvV19 ] [-S suffix] [ name ... ]
gunzip [ -acfhlLnNrtvV ] [-S suffix] [ name ... ]
zcat [ -fhLV ] [ name ... ]
DESCRIPTION
Gzip reduces the size of the named files using Lempel-Ziv coding
(LZ77). Whenever possible, each file is replaced by one with the
extension .gz, while keeping the same ownership modes, access and modi-
fication times. (The default extension is -gz for VMS, z for MSDOS,
OS/2 FAT, Windows NT FAT and Atari.)
Thanks !! This code is working as expected. I tried with less than a million records in each file and it gave me the output in less than a minute. But when i tried with 27 million records in each file, it is still executing from an hour. Will this consume lot of disk space ? Is there a way to get the output faster ?
It has to hold the complete, uncompressed contents of "a" in memory to tell if any lines from "b" exist in it. How else would it know, when it can't make any assumptions like ordering? This doesn't take disk space but takes as much memory as it needs to hold "a" uncompressed.
That can't be simplified or sped up without sorting the input files first -- which takes time itself, and would alter the order of output.