Find text that is different in two files

cmccabe · May 27, 2015, 11:16am

In the attached files, I am trying to use import.txt to find what is missing in all.txt and print the missing lines in missing.txt. I used SQL to import a list into a database and got errors and need to figure out what didn't import correctly. The below script is close, I think, but doesn't result in the desired output (all the lines that do not have a match with import.txt)

So if the text in import.txt is in all.txt that line is not printed, however if the text in import.txt is not in all.txt, then the entire line is printed in missing.txt. Thank you :).

 awk '{ h[$0] = ! h[$0] } END { for (k in h) if (h[k]) print k }' import.txt all.txt > missing.txt

Also tried:

 diff import.txt all.txt | perl -lne 'if(/^[<>]/){s/^..//g;print}' > missing.txt

mjf · May 27, 2015, 1:54pm

cmccabe,
I'm having trouble making out the lines on the smaller file you attached (the other one is 87.7mb). Can you attach smaller exmaple files? Also, can you show your expected results and the undesired results you are getting?
If you are comparing the entire lines on the files (e.g. not matching on specific keys) and are only looking for lines in import.txt not in all.txt, another way you can try is the below if files are sorted (or you can pre-sort):

comm -23 import.txt all.txt > missing.txt

Re: using SQL to import into a database and you need to figure out what didn't import correctly, are you not capturing the records that failed to load at that point? e.g. using Oracle SQL Loader and a .bad file, you can store the records that failed to load during the insert.

cmccabe · May 27, 2015, 2:37pm

I have attached smaller files of each. Basically, the desired output.txt would be all the lines that do not match import.txt (should be 6 out of the 10) - All the PXL- do not match so they are written to output.txt. Thank you :).

import.txt

ADAMTS10E10for
ADAMTS10E10rev
ADAMTS10E11for
ADAMTS10E11rev
ADAMTS10E20for
ADAMTS10E20rev

all.txt

NULL	NULL	NULL	NULL	20152005	630	admin	Imported	PXL-A0285435ref	26950850	NULL	Y	NULL	37	NULL	NULL	NULL	pxlence	SeqRxn4
NULL	NULL	NULL	NULL	20152005	630	admin	Imported	PXL-A0285435antiref	26951039	NULL	Y	NULL	37	NULL	NULL	NULL	pxlence	SeqRxn4
NULL	NULL	NULL	NULL	20152005	630	admin	Imported	PXL-A0285441ref	26980056	NULL	Y	NULL	37	NULL	NULL	NULL	pxlence	SeqRxn4
NULL	NULL	NULL	NULL	20152005	630	admin	Imported	PXL-A0285441antiref	26980301	NULL	Y	NULL	37	NULL	NULL	NULL	pxlence	SeqRxn4
NULL	NULL	NULL	NULL	20152005	630	admin	Imported	PXL-A0285472ref	27190068	NULL	Y	NULL	37	NULL	NULL	NULL	pxlence	SeqRxn4
NULL	NULL	NULL	NULL	20152005	630	admin	Imported	PXL-A0285472antiref	27190236	NULL	Y	NULL	37	NULL	NULL	NULL	pxlence	SeqRxn4
NULL	NULL	NULL	NULL	20141009	630	admin	Imported	ADAMTS10E10for	8661383	8661400	19	CGCCTATGAAGGCAGTGG	37	20130823	20130903	20160901	GC rich region - no M13 primers	SeqRxn2
NULL	NULL	NULL	NULL	20141009	630	admin	Imported	ADAMTS10E10rev	8661119	8661101	19	AATCTGGGGAAAGGGGTGT	37	20130823	20130903	20160901	GC rich region - no M13 primers	SeqRxn2
NULL	NULL	NULL	NULL	20141009	630	admin	Imported	ADAMTS10E11for	8661258	8661276	19	ATGTGTGAGCGCGAGAGAA	37	20131007	20131007	20161001	GC rich region - no M13 primers	SeqRxn2
NULL	NULL	NULL	NULL	20141009	630	admin	Imported	ADAMTS10E11rev	8660932	8660914	19	ATGAGTGTGACCCGCTCTG	37	20131007	20131007	20161001	GC rich region - no M13 primers	SeqRxn2

The undesired output is also attached from:

 awk '{ h[$0] = ! h[$0] } END { for (k in h) if (h[k]) print k }' import.txt all.txt > missing.txt

mjf · May 27, 2015, 4:01pm

Try sometrhing like this:

$nawk 'FNR==NR{map[$1]=$0;next} !map[$18]' import.txt all.txt       
1       1       20152005        630     admin   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    20152005        630     admin   Imported        PXL-A0285435ref 26950850        NULL    Y       NULL    37      NULL    NULL    NULL    pxlence SeqRxn4
1       1       20152005        630     admin   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    20152005        630     admin   Imported        PXL-A0285435antiref     26951039        NULL    Y       NULL    37      NULL    NULL    NULL    pxlence SeqRxn4
1       1       20152005        630     admin   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    20152005        630     admin   Imported        PXL-A0285441ref 26980056        NULL    Y       NULL    37      NULL    NULL    NULL    pxlence SeqRxn4
1       1       20152005        630     admin   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    20152005        630     admin   Imported        PXL-A0285441antiref     26980301        NULL    Y       NULL    37      NULL    NULL    NULL    pxlence SeqRxn4
1       1       20152005        630     admin   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    20152005        630     admin   Imported        PXL-A0285472ref 27190068        NULL    Y       NULL    37      NULL    NULL    NULL    pxlence SeqRxn4
1       1       20152005        630     admin   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    20152005        630     admin   Imported        PXL-A0285472antiref     27190236        NULL    Y       NULL    37      NULL    NULL    NULL    pxlence SeqRxn4

cmccabe · May 27, 2015, 4:04pm

Works perfect.... thank you :).

mjf · May 27, 2015, 4:13pm

Good, you're welcome. But have you looked at identifying these records (or the next ones that you might load) that failed at insert time? Or perhaps this was a one time load and not something you will repeat again?

cmccabe · May 27, 2015, 4:16pm

This was a one time load, that, hopefully, will not be repeated again. I think the error "timed-out" and only completed half of the files. But I had noo idea which ones until now.... thanks again :).

MadeInGermany · May 27, 2015, 6:43pm

Actually you don't need to store the long $0 in memory,
map[$1]=1 is enough.
Most awk can store even nothing, and can lookup elements without defining them if not found

awk 'FNR==NR{map[$1];next} ($18 in map)' import.txt all.txt

mjf · May 27, 2015, 9:10pm

MadeInGermany,
Thanks for the tip, however, based on the requirement, the check to see if the variable exists in the array needs to be negated:

 
nawk 'FNR==NR{map[$1];next} !($18 in map)' import.txt all.txt