compare 2 files and return unique lines in each file (based on condition)

anurupa777 · August 25, 2012, 8:46am

hi
my problem is little complicated one. i have 2 files which appear like this

file 1
abbsss:aa:22:34:as akl abc 1234
mkilll:as:ss:23:qs asc abc 0987
mlopii:cd:wq:24:as asd abc  7866

file2
lkoaa:as:24:32:sa alk abc 3245
lkmo:as:34:43:qs qsa abc 0987
kloia:ds:45:56:sa acq abc 7805

i have to check the unique lines on the basis of 4rth field (numerical field) which is always after abc in both files. i should check whether the value is matching with the line in other file (might have differet order) within +/- 100 range. As in example

mlopii:cd:wq:24:as asd abc  7866
kloia:ds:45:56:sa acq abc 7805

are not considered unique because they fall with in +/- 100 range. so my output should be as follows when checking for unique lines in file 1

abbsss:aa:22:34:as akl abc 1234

and while checking for unique lines in file 2

lkoaa:as:24:32:sa alk abc 3245

hope i am clear.

RudiC · August 25, 2012, 10:19am

Are the input files sorted on col 4, or can they be?

msabhi · August 25, 2012, 12:48pm

 awk 'NR==FNR{a[FNR]=$0;} NR!=FNR{b[FNR]=$0;} END{for(x in a) { split(a[x],c_a," ");split(b[x],c_b," "); if(c_b[4]!= c_a[4] && (c_b[4]-c_a[4]>=100 || c_b[4]-c_a[4] <= -100)) {printf("%s\n%s\n",a[x],b[x]);}}}' file1 file2

Forget the above..got the req wrong i guess...

Don_Cragun · August 25, 2012, 12:51pm

Try this:

awk 'FNR == NR { # Accumulate records from 1st file.
        f1[++n1] = $0
        low1[n1] = $4 - 100
        mid1[n1] = $4
        high1[n1] = $4 + 100
        next
}
        { # Accumulate records from 2nd file
        low2[++n2] = $4 - 100
        high2[n2] = $4 + 100
        # Look for lines in 1st file that are in range of $4 in 2nd file
        for(i = 1; i <= n1; i++)
                if(($4 > low1) && ($4 < high1))
                        next # match found
        # This line is unique.
        print $0 > "UniqueIn2ndFile"
}
END     { # Look for lines in 2nd file that are unique versus 1st file
        for(j = 1; j <= n1; j++) {
                for(i = 1; i <= n2; i++)
                        if((mid1[j] > low2) && (mid1[j] < high2))
                                break # match found
                if(i > n2) print f1[j] > "UniqueIn1stFile"
        }
}' file1 file2

pravin27 · August 26, 2012, 1:09am

Could ths help you ?

awk 'NR==FNR{a[NR]=$0;b[NR]=$4;next}
{if($4-b[FNR] > 100 || $4-b[FNR] < -100){ print a[FNR]; print $0}}' file1 file2

anurupa777 · August 26, 2012, 3:21am

Thank you very much for all posts. the files are not ordered and i hope the code provided by all would work even then