File comaprsons for the Huge data files ( around 60G) - Need optimized and teh best way to do this

kartikirans · October 25, 2018, 9:54am

I have 2 large file (.dat) around 70 g, 12 columns but the data not sorted in both the files.. need your inputs in giving the best optimized method/command to achieve this and redirect the not macthing lines to the thrid file ( diff.dat)

File 1 - 15 columns
File 2 - 15 columns

Data is not in sorted order.

vgersh99 · October 25, 2018, 10:00am

What is this in method/command to achieve this ?
Sample files and the desired output would help as well...

kartikirans · October 25, 2018, 10:11am

sample look-

2036|001|021|92|570|2|422|1|0|0|0|570|0|0|12

Field separate - "|"

File 1 Size ( 60 G)
File 2 Size ( 61 g)
Note - data is not in the sorted order ( file1 and file2)

Requirement, I need to find the not matching lines and redirect those to new file "differnce.dat"

vgersh99 · October 25, 2018, 10:18am

what constitutes "non-matching" lines?
Entire line or some key fields in file1 and 2 to match on?
You have to be clearer with your requirement statements.

Also, please use code tags when posting code/data samples.

kartikirans · October 25, 2018, 10:19am

Thanks for the quick reply, Entire line...

vgersh99 · October 25, 2018, 10:23am

look into man grep with options -F and -f .
Or man fgrep

kartikirans · October 25, 2018, 10:30am

grep -F -x -v -f file2 file1 ?? or any other optimization command

vgersh99 · October 25, 2018, 10:35am

sounds about right.
Just remember - whatever you do, comparing 60G files will be slow...
Test this on a smaller chunks to see if you're getting the desired results first.

gull04 · October 31, 2018, 8:57am

Hi kartikirans,

I'd be tempted to look at comm -3 ${file1} ${file2} this will suppress lines common to ${file1} and ${file2} later versions of comm don't require the files to be sorted.

Regards

Gull04

bakunin · October 31, 2018, 12:11pm

One additional question: what means "non-matching lines"?

only Lines in file1 which are not in file2? or
plus lines in file2 which are not in file1?

bakunin