Print the 1st column and the value in 2nd or 3rd column if that is different from the values in 1st

I have file that looks like this,

DIP-17571N|refseq:NP_651151   DIP-17460N|refseq:NP_511165|uniprotkb:P45890      DIP-17571N|refseq:NP_651151
DIP-19241N|refseq:NP_524261    DIP-19241N|refseq:NP_524261       DIP-17151N|refseq:NP_524316|uniprotkb:O16797
DIP-19588N|refseq:NP_731165     DIP-19588N|refseq:NP_731165       DIP-19589N|refseq:NP_647684
DIP-20632N|refseq:NP_476602     DIP-492N|refseq:NP_477499|uniprotkb:P23647        DIP-20632N|refseq:NP_476602
DIP-23436N|refseq:NP_536784     DIP-23436N|refseq:NP_536784       DIP-23130N|refseq:NP_652017
DIP-18269N|refseq:NP_523724     DIP-20786N|refseq:NP_649297       DIP-18269N|refseq:NP_523724
DIP-20861N|refseq:NP_647634    DIP-20861N|refseq:NP_647634       DIP-19344N|refseq:NP_572751
DIP-23837N|refseq:NP_573057   DIP-23837N|refseq:NP_573057       DIP-5N|refseq:NP_476859|uniprotkb:P07207
DIP-59926N|refseq:NP_228099     DIP-59926N|refseq:NP_228099       DIP-59927N|refseq:NP_228100
DIP-23655N|refseq:NP_648922    DIP-17971N|refseq:NP_648929       DIP-23655N|refseq:NP_648922
DIP-22713N|refseq:NP_524108    DIP-21138N|refseq:NP_722721       DIP-22713N|refseq:NP_524108
DIP-21320N|refseq:NP_730973     DIP-17533N|refseq:NP_611700       DIP-21320N|refseq:NP_730973
DIP-22051N|refseq:NP_573109     DIP-28047N        DIP-22051N|refseq:NP_573109

I want to print the 1st column and the value in 2nd or 3rd column if that is different from the values in 1st column, side by side.

This is how I want the output to be like,

DIP-17571N|refseq:NP_651151   DIP-17460N|refseq:NP_511165|uniprotkb:P45890
DIP-19241N|refseq:NP_524261   DIP-17151N|refseq:NP_524316|uniprotkb:O16797
DIP-19588N|refseq:NP_731165    DIP-19589N|refseq:NP_647684
DIP-20632N|refseq:NP_476602    DIP-492N|refseq:NP_477499|uniprotkb:P23647 
DIP-23436N|refseq:NP_536784    DIP-23130N|refseq:NP_652017
DIP-18269N|refseq:NP_523724    DIP-20786N|refseq:NP_649297   

and so on...

Any help would be highly appreciated.

Hello Syeda,

Could you please try following and let me know if this helps you.

awk '{A=$1}($1 != $2){A=$1 OFS $2} ($1 != $3){A=A?A OFS $3:$3} {print A;A=""}'  Input_file
 

Output will be as follows.

DIP-17571N|refseq:NP_651151 DIP-17460N|refseq:NP_511165|uniprotkb:P45890
DIP-19241N|refseq:NP_524261 DIP-17151N|refseq:NP_524316|uniprotkb:O16797
DIP-19588N|refseq:NP_731165 DIP-19589N|refseq:NP_647684
DIP-20632N|refseq:NP_476602 DIP-492N|refseq:NP_477499|uniprotkb:P23647
DIP-23436N|refseq:NP_536784 DIP-23130N|refseq:NP_652017
DIP-18269N|refseq:NP_523724 DIP-20786N|refseq:NP_649297
DIP-20861N|refseq:NP_647634 DIP-19344N|refseq:NP_572751
DIP-23837N|refseq:NP_573057 DIP-5N|refseq:NP_476859|uniprotkb:P07207
DIP-59926N|refseq:NP_228099 DIP-59927N|refseq:NP_228100
DIP-23655N|refseq:NP_648922 DIP-17971N|refseq:NP_648929
DIP-22713N|refseq:NP_524108 DIP-21138N|refseq:NP_722721
DIP-21320N|refseq:NP_730973 DIP-17533N|refseq:NP_611700
DIP-22051N|refseq:NP_573109 DIP-28047N
 

Also want to add here if both columns $2 and $3 are not equal to $1 then complete line will be printed(which is not in provided Input_file).

Thanks,
R. Singh

1 Like

Unfortunately, the | char has a special meaning for regexes, so it must be circumvented; else it were much simpler:

 awk '{T="  *" $1; gsub (/\|/, "\\\|", T); sub (T, " ")} 1' file