Plz Help. Compare 2 files field by field and get the output in another file.

i150371485 · July 25, 2012, 2:21am

Hi Freinds,
I have 2 files . one is source.txt and second one is target.txt. I want to keep source.txt as baseline and compare target.txt. please find the data in 2 files and Expected output.

Source.txt

1|HYD|NAG|TRA|34.5|1234
2|CHE|ESW|DES|36.5|134
3|BAN|MEH|TRA|33.5|234
4|PUN|ABHI|TA|38.5|123
5|KIN|NAV|PRA|31.5|135

target.txt

1|HYD|NAG|TRA|34.5|1234
2|CHE|EW|DES|33.5|134
5|KIN|NV|PRA|31.5|136

Expected output should be :

Number of Extra records in Source file : 2

3|BAN|MEH|TRA|33.5|234
4|PUN|ABHI|TA|38.5|123

Number of mismatches:

KEYFIELD|COLUMN_NUMBER|SOURCE_VALUE|TARGET_VALUE
2 |3 |ESW |EW
2 |5 |36.5 |33.5
5 |3 |NAV |NV

Please help. I am Very new to Unix shell scripting .

PikK45 · July 25, 2012, 2:29am

I guess this is already discussed!! I saw almost the same examples too!!

i150371485 · July 25, 2012, 3:26am

Hi Pikka45 , yes we have done this with the follwing code

paste -d '|' File1.txt File2.txt | awk -F '|' '{c=NF/2;for(i=1;i<=c;i++)if($i!=$(i+c))printf "line %-5s field %s\n",NR,i}'

The above code is used only for the below condition :

1)when we have equal number of records where all the keyfields (let assume first field) present in both the source.txt and target.txt

2) When we have source.txt.count < target.txt.count and all the keyfields (let assume 1st filed) are present in both the source.txt and target.txt.

3) it is not working when source.txt.count > target.txt.count and if there are mismatches .

I hope you understand the above cases. Kindly help if there is any chance to cover the 3rd senario.

---------- Post updated at 12:56 PM ---------- Previous update was at 12:10 PM ----------

Hi Friends, Can anyone look into the Thread .. Plz help..

Chirel · July 25, 2012, 4:20am

Hi, check this

#!/bin/bash

>extra.txt
>mismatch.txt
while read sLine; do
    OFS="$IFS"
    IFS="|"
    sTab=( $sLine );
    tLine="$(egrep "^"${sTab[0]} target.txt)"
    if [ -z "$tLine" ]; then echo "$sLine" >>extra.txt; IFS="$OFS"; continue; fi
    tTab=( $tLine );
    for (( i = 1 ; i < ${#sTab[@]} ; i++ )); do
        [ "${sTab[$i]}" = "${tTab[$i]}" ] || echo "${sTab[0]}|$i|${sTab[$i]}|${tTab[$i]}" >>mismatch.txt
    done
    IFS="$OFS"
done <source.txt

echo "Number of Extra records in Source file : $(cat extra.txt|wc -l)"
cat extra.txt

echo "Number of mismatches : $(cat mismatch.txt|wc -l)"
cat mismatch.txt

i150371485 · July 25, 2012, 8:09am

@Chirel : Thank you so much it worked as expected. I have one doubt. if i use for some 50,000 records with 50 columns it is taking more time. is there any way we can reduce the timing and increase the performance ? Plz help ..

summer_cherry · July 26, 2012, 12:32am

open my $fh,"<a.txt";
while(<$fh>){
	my @tmp = split("[|]",$_);
	my @t = @tmp[1..$#tmp];
	$hash{$tmp[0]} = \@t;
}
close $fh;
while(<DATA>){
	my @tmp = split("[|]",$_);
	if(not exists $hash{$tmp[0]}){
		print;
	}
	else{
		my @t = @{$hash{$tmp[0]}};
		my @diff;
		for(my $i=0;$i<=$#t;$i++){
			if($t[$i] ne $tmp[$i+1]){
				push @diff, ($i+2,$t[$i],$tmp[$i+1]);
			}
		}
		print join "|", ($tmp[0],@diff) if $#diff>=0;
		print "\n";
	}
}
__DATA__
1|HYD|NAG|TRA|34.5|1234
2|CHE|ESW|DES|36.5|134
3|BAN|MEH|TRA|33.5|234
4|PUN|ABHI|TA|38.5|123
5|KIN|NAV|PRA|31.5|135

a.txt

1|HYD|NAG|TRA|34.5|1234
2|CHE|EW|DES|33.5|134
5|KIN|NV|PRA|31.5|136