compare huge file

salaathi · February 9, 2008, 2:52am

Hi,
I have files with records of 40,00,000& 39,00,000 and i want to find out the

content

1.which is existing in file1 and not in file2.
2.Which is exisitng in file2 and not in file1.

The format of the file will be like

404ABCDEFGHIJK|CDEFGHIJK|1234567890|1

If its a smaller one i used to do egrep -f .

Need your help to sort it out.

otheus · February 9, 2008, 4:33am

If your machine has enough memory (I would hope 2 GB is enough), you should be able to do something like this:

sort f1 >f1.$$
sort f2 >f2.$$
diff f1.$$ f2.$$
# rm f1.$$ f2.$$

Here's a grep -f method that doesn't use a lot of memory, but takes a long time:

cat f1 | \
while read f1rec; do 
  fgrep -- "$f1rec" f2  >/dev/null || echo -- "$f1rec"
  # The -- may not work in all UNIX's - 
  # they are to ensure that records beginning with a record starting 
  # with "-" will not be interpreted as an option
done

That will find all the records in f1 not in f2. Just swap the variables to do the reverse effect.

If all you want to do is merge the files, and no duplicates are allowed, here you go:

sort -u f1 f2 >merged