Comapring columns in 2 files and printing the values that differ.

I have a file (file-1) that looks like this,

DIP-10097N|refseq:NP_416170|uniprotkb:P30015
DIP-10117N|refseq:NP_414973|uniprotkb:P08177
DIP-10168N|refseq:NP_418766|uniprotkb:P15005
DIP-10199N|refseq:NP_415632|uniprotkb:P30958
DIP-10358N|refseq:NP_418659|uniprotkb:P28903
DIP-10440N|refseq:NP_289596|uniprotkb:P20082
DIP-10441N|refseq:NP_417502|uniprotkb:P20083
DIP-10441N|refseq:NP_417502|uniprotkb:P20083
DIP-10467N|refseq:NP_415423|uniprotkb:P09373
DIP-10469N|refseq:NP_418386|uniprotkb:P32674
DIP-10562N|refseq:NP_418370|uniprotkb:P17888
DIP-10582N|refseq:NP_414864|uniprotkb:P77743
DIP-10592N|refseq:NP_415819|uniprotkb:P37344

and another (file-2) which looks like this,

DIP-10331N|refseq:NP_311078|uniprotkb:P12638    DIP-10117N|refseq:NP_414973|uniprotkb:P08177
DIP-10331N|refseq:NP_311078|uniprotkb:P12638    DIP-10840N|refseq:NP_414640|uniprotkb:P10408
DIP-1025N|refseq:NP_414574|uniprotkb:P00968     DIP-10097N|refseq:NP_416170|uniprotkb:P30015
DIP-10467N|refseq:NP_415423|uniprotkb:P09373    DIP-10097N|refseq:NP_416170|uniprotkb:P30015
DIP-10117N|refseq:NP_414973|uniprotkb:P08177    DIP-10117N|refseq:NP_414973|uniprotkb:P08177
DIP-10117N|refseq:NP_414973|uniprotkb:P08177    DIP-10750N|refseq:NP_289799|uniprotkb:P02410
DIP-10117N|refseq:NP_414973|uniprotkb:P08177    DIP-10757N|refseq:NP_288150|uniprotkb:P02421

In output I want to print the contents of file-1 plus the value in either column of file-2 that has the same value as that of file-1 in the other column. Like this,

DIP-10097N|refseq:NP_416170|uniprotkb:P30015 DIP-1025N|refseq:NP_414574|uniprotkb:P00968
DIP-10097N|refseq:NP_416170|uniprotkb:P30015 DIP-10467N|refseq:NP_415423|uniprotkb:P09373
DIP-10117N|refseq:NP_414973|uniprotkb:P08177 DIP-10117N|refseq:NP_414973|uniprotkb:P08177
DIP-10117N|refseq:NP_414973|uniprotkb:P08177 DIP-10750N|refseq:NP_289799|uniprotkb:P02410
DIP-10117N|refseq:NP_414973|uniprotkb:P08177    DIP-10757N|refseq:NP_288150|uniprotkb:P02421
DIP-10117N|refseq:NP_414973|uniprotkb:P08177 DIP-10331N|refseq:NP_311078|uniprotkb:P12638
DIP-10467N|refseq:NP_415423|uniprotkb:P09373 DIP-10097N|refseq:NP_416170|uniprotkb:P30015

Any help would be highly appreciated.

Hello Syeda,

Not clear to me like do you want to compare both the fields of the files, could you please try following and let me know if this helps you.

awk 'FNR==NR{A[$1];next} ($1 in A){print $0} ($2 in A){print $0}' file1 file2
 

Output will be as follows.

DIP-10331N|refseq:NP_311078|uniprotkb:P12638    DIP-10117N|refseq:NP_414973|uniprotkb:P08177
DIP-1025N|refseq:NP_414574|uniprotkb:P00968     DIP-10097N|refseq:NP_416170|uniprotkb:P30015
DIP-10467N|refseq:NP_415423|uniprotkb:P09373    DIP-10097N|refseq:NP_416170|uniprotkb:P30015
DIP-10467N|refseq:NP_415423|uniprotkb:P09373    DIP-10097N|refseq:NP_416170|uniprotkb:P30015
DIP-10117N|refseq:NP_414973|uniprotkb:P08177    DIP-10117N|refseq:NP_414973|uniprotkb:P08177
DIP-10117N|refseq:NP_414973|uniprotkb:P08177    DIP-10117N|refseq:NP_414973|uniprotkb:P08177
DIP-10117N|refseq:NP_414973|uniprotkb:P08177    DIP-10750N|refseq:NP_289799|uniprotkb:P02410
DIP-10117N|refseq:NP_414973|uniprotkb:P08177    DIP-10757N|refseq:NP_288150|uniprotkb:P02421
 

Here I am comparing $1 of file1 to both the fields of file2.
Let us know how It goes for you.

Thanks,
R. Singh

No, not really, this isn't exactly what I wanted.
There was a little mistake in the output I had shown, sorry! I've corrected it now.

I do want to compare $1 of file-1 with both columns of file-2, but in the output I want to print $1 first and then the value (same or different) that is next to it in file-2.

Try

awk '
FNR==NR         {C[$1]++
                 R[$1,C[$1]]=$2
                 C[$2]++
                 R[$2,C[$2]]=$1
                 next
                }
$1 in C         {for (i=1; i<=C[$1]; i++) print $1, R[$1,i]
                }
' file2 file1      
DIP-10097N|refseq:NP_416170|uniprotkb:P30015 DIP-1025N|refseq:NP_414574|uniprotkb:P00968
DIP-10097N|refseq:NP_416170|uniprotkb:P30015 DIP-10467N|refseq:NP_415423|uniprotkb:P09373
DIP-10117N|refseq:NP_414973|uniprotkb:P08177 DIP-10331N|refseq:NP_311078|uniprotkb:P12638
DIP-10117N|refseq:NP_414973|uniprotkb:P08177 DIP-10117N|refseq:NP_414973|uniprotkb:P08177
DIP-10117N|refseq:NP_414973|uniprotkb:P08177 DIP-10117N|refseq:NP_414973|uniprotkb:P08177
DIP-10117N|refseq:NP_414973|uniprotkb:P08177 DIP-10750N|refseq:NP_289799|uniprotkb:P02410
DIP-10117N|refseq:NP_414973|uniprotkb:P08177 DIP-10757N|refseq:NP_288150|uniprotkb:P02421
DIP-10467N|refseq:NP_415423|uniprotkb:P09373 DIP-10097N|refseq:NP_416170|uniprotkb:P30015
1 Like

Or Ravinder's solution slightly changed

awk 'FNR==NR{A[$1];next} ($1 in A){print $1,$2} ($2 in A){print $2,$1}' file1 file2

The order of the output is the one of file2.

1 Like