Finding records NOT on another file

wbport · November 2, 2018, 1:00pm

I have three files named ALL, MATCH, and DIFF. Match and diff have completely different records included in the "all" file, but the "all" file also has records not in either the Match or Diff files.

I know I can sort all three files together, one unique and one without that option to show which ones appear in two files by running diff, but how can I find the records that are only in the "all" file?

TIA

Corona688 · November 2, 2018, 4:10pm

If ALL is small enough to fit in memory:

awk 'NR==FNR { A[$0] ; next } ; $0 in A { delete A[$0] } END { for(X in A) { print X }' ALL MATCH DIFF

RudiC · November 3, 2018, 5:38am

Try also

sort ALL MATCH DIFF | uniq -c | grep "^ *1"

apmcd47 · November 6, 2018, 4:28am

Sorted (untested):

comm -23 <(sort ALL) <(sort MATCH DIFF)

Unsorted (untested):

fgrep -f <(comm -23 <(sort ALL) <(sort MATCH DIFF) ALL)

You may wish to use the -u switch to sort to remove duplicate lines.

Andrew

Don_Cragun · November 6, 2018, 6:04am

One could also try:

awk 'FNR == 1 { fc++ } fc < 3 {d[$0]; next } !($0 in d)' DIFF MATCH ALL

which has been tested.

This requires enough space for the unique records in DIFF and MATCH to be held in memory, but doesn't require space in memory for the unique records in ALL .

MadeInGermany · November 6, 2018, 7:01am

The following variant works with any number of "exclude"-files

awk 'BEGIN {nfiles=ARGC-1} FNR == 1 { fc++ } fc < nfiles {d[$0]; next } !($0 in d)' DIFF MATCH ALL

Another idea: make the last filename special

awk 'FILENAME!="-" { d[$0]; next } !($0 in d)' MATCH DIFF - < ALL