Compare 2 text file with 1 column in each file and write mismatch data to 3rd file

Divya_Nochiyil · October 24, 2013, 10:33am

Hi,
I need to compare 2 text files with around 60000 rows and 1 column. I need to compare these and write the mismatch data to 3rd file.
File1 - file2 = file3

wc -l file1.txt
58112
wc -l file2.txt
55260
 
head -5 file1.txt
101214200123
101214700300
101250030067
101214100500
109912312312
 
head -5 file2.txt
101250030067
101214200123
101214700333
109912312312
101214700300

I can sort the files.
What shall I do after that or any other way

Thanks,
Divya

pamu · October 24, 2013, 10:46am

with awk

 $ awk 'NR==FNR{A[$1]++;next}{if(! A[$1]){print }else{A[$1]=0}}END{for(i in A){if(A){print i}}}' file1 file2

101214700333
101214100500

Akshay_Hegde · October 24, 2013, 11:08am

You may try this also

$ awk 'FNR==NR{A[$1]++;next}{if(!($1 in  A))print;else delete A[$1]}END{for (i in A)print i}' file1 file2

101214700333
101214100500

paresh_n_doshi · October 28, 2013, 8:36am

CAN YOU PLEASE EXPLAIN WHAT {A[$1]++" EXACTLY " DOES? IT IS CREATING AN ARRAY BUT NOT ASSIGNING ANYTHING. FURTHER INCREMENT SIGN IS NOT UNDERSTOOD BY ME. PLEASE HELP ME UNDERSTAND THIS.

Subbeh · October 28, 2013, 8:53am

You could use the comm command for this:

comm -23 file1.txt file2.txt

The files need to be sorted though

pamu · October 28, 2013, 9:11am

Please check below this may clear your doubts...

$ cat file

101250030067
101214200123
101214700333
109912312312
101214700300
101214700333
101214700333
109912312312

$ awk '{A[$1]++}END{for(i in A) print i,A}' file

101214700333 3 # It has 3 occurrence in the file so A=3
101214200123 1 # It has 1 occurrence in the file so A=1
109912312312 2 # It has 2 occurrence in the file so A=2
101214700300 1
101250030067 1

MadeInGermany · October 28, 2013, 11:02am

Using the exist clause (x in A) , it's indeed possible to define an array element without a value:

awk 'FNR==NR {A[$1]; next}
{if ($1 in  A) delete A[$1]; else print $1} END {for (i in A) print i}' file1 file2

Divya_Nochiyil · November 4, 2013, 8:02am

I need the missing data from file 1 alone...How can we do that..

cat file1
 
101250030067
101214200123
101214700333
109912312312


cat file2
101250030067
101214200123
101214700333
101214700300

File3 should be 109912312312 alone.
101214700300 is not needed.
ie. Missing data

greet_sed · November 4, 2013, 8:08am

How about grep ?

grep -v -f file2 file1 > file3

It works for given sample input.

Divya_Nochiyil · November 4, 2013, 8:24am

As the 1st file contains 50000 lines, grep is taking too much time.
Can we have a better way.

greet_sed · November 4, 2013, 8:45am

Try this:

awk 'NR==FNR{a[$1]=$1;next} { if (!a[$1]) { print $1 } } ' file2 file1

output:

109912312312

cat file2
101250030067
101214200123
101214700333
101214700300

cat file1
101250030067
101214200123
101214700333
109912312312