Compare two big files for differences using Linux

shanul_karim · August 15, 2017, 12:08pm

Hello everybody

Looking for help in comparing two files in Linux(files are big 800MB each).

Example:-

File1 has below data

$ cat file1
5,6,3
2.1.4
1,1,1
8,9,1

File2 has below data

$ cat file2
5,6,3
8,9,8
1,2,1
2,1,4

Need Output as below

8,9,8
1,2,1
1,1,1
8,9,1

tried below awk command but it giving below output which is not correct

$ awk 'NR==FNR{a[$0]++;next} !a[$0]' file2 file1
2.1.4
1,1,1
8,9,1

$ cat vlookup.awk
FNR==NR{
a[$1]=$2
next
}
{ if ($1 in a) {print $1, a[$1]} else {print $1, "NA"} }

awk -f vlookup.awk file2 file1 | column -t
$ awk -f vlookup.awk file2 file1 | column -t
5,6,3
2.1.4 NA
1,1,1 NA
8,9,1 NA

treid below do while loop with grep command but its taking lot of time.

$ cat scp.sh
rm -f newfile.txt
while read line
do
line1=`grep -ie "${line}" file1`
if [ $? -ne 0 ] ; then
echo "$line" >> file2
fi
done <CUDB_REF

./scp.sh
8,9,8
1,2,1

This is correct but taking very long time for big file

Pls suggest better way which is fast.

RudiC · August 15, 2017, 12:30pm

How about

sort file[12] | tr '.' ',' | uniq -c | grep "^ *1"
      1 1,1,1
      1 1,2,1
      1 8,9,1
      1 8,9,8

EDIT: or even

sort file[12] | tr '.' ',' | uniq -u
1,1,1
1,2,1
8,9,1
8,9,8

shanul_karim · August 15, 2017, 12:55pm

Thanks RudiC

And How about getting common lines out of these files

RudiC · August 15, 2017, 1:12pm

How about man uniq ? Look for the -d option...

shanul_karim · August 15, 2017, 1:14pm

hi RudiC

The sort is good to list out differences but my requirement is to read content from file1 and check it from file2 and if its not present then print it .
Exactly what this do while and grep is doing. but in faster manner since the below code taking so much of time.

$ cat scp.sh
rm -f newfile.txt
while read line
do
line1=`grep -ie "${line}" file1`
if [ $? -ne 0 ] ; then
echo "$line" >> file2
fi
done <file2
  
 ./scp.sh
8,9,8
1,2,1

RudiC · August 15, 2017, 1:23pm

That's NOT what you requested:

Try - given you have a recent bash for your shell which you failed to mention -

comm <(sort file1 | tr '.' ',') <(sort file2 | tr '.' ',')
1,1,1
	1,2,1
		2,1,4
		5,6,3
8,9,1
	8,9,8

shanul_karim · August 15, 2017, 1:35pm

Thanks RudiC for your valuable feedback and resolution

Yes ture I need what you have shared in earlier chat. The only issue in output file I am unable to distinguish thar the difference entry belong to which file file1 or file2.

like

8,9,8 >> from file1
1,2,1 >> from file1
1,1,1 >> from file2
8,9,1 >> from file2

Since my files are very big around 800 MB each and for this I need this.

if possible to get two different file. One listing differences from file1 to file2 and other listing difference file2 to file1.

RudiC · August 15, 2017, 1:40pm

I'm afraid you mixed up the files (compared to post#1). Ploughing through huge files multiple times may be very time consuming... Did you consider to read man comm , here the -1 and -2 options?

shanul_karim · August 15, 2017, 1:58pm

Hi RudiC

I checked this option as well.

Really helpful if you make changes in below sort command to get differences in file2 as compared to file 1.

sort file[12] | tr '.' ',' | uniq -u