Lookup name from another file

gina.lizar · January 28, 2014, 1:23pm

Hi All,

I want to lookup name for an id in col2 input from another file and add the name to each line.

Input 1

 comp100001_c0_seq1      At1g31340       30.40   569     384     11      3       1673    313     834     7e-62    237
 comp100003_c0_seq1      At1g35370_2     35.00   80      50      2       597     364     678     753     1e-09   42.7

Input 2

       [R] KOG0017 FOG: Transposon-encoded proteins with TYA, reverse transcriptase, integrase domains in various combinations
         ath:  At1g10260
         ath:  At1g11265
         ath:  At1g35050
         ath:  At1g35370_2
           ath:  At1g35647
         
         [OR] KOG0001 Ubiquitin and ubiquitin-like proteins
         ath:  At1g31340
         ath:  At1g53930
         ath:  At1g53950
         ath:  At1g53980
             ath:  At1g64470

Expected output

comp100001_c0_seq1      At1g31340       30.40   569     384     11      3       1673    313     834     7e-62    237    [OR] KOG0001 Ubiquitin and ubiquitin-like proteins
 comp100003_c0_seq1      At1g35370_2     35.00   80      50      2       597     364     678     753     1e-09   42.7    [R] KOG0017 FOG: Transposon-encoded proteins with TYA, reverse transcriptase, integrase domains in various combinations

RavinderSingh13 · January 28, 2014, 1:34pm

Hello,

Following may help.

awk 'NR==FNR{a[$2];next} ($2 in a) {print $0}' file2  file1

Output will be as follows.

comp100001_c0_seq1      At1g31340       30.40   569     384     11      3       1673    313     834     7e-62    237
 comp100003_c0_seq1      At1g35370_2     35.00   80      50      2       597     364     678     753     1e-09   42.7

EDIT: What is the logic to get the last column data in your expected Output. Sorry I have noticed just now the last column.

Thanks,
R. Singh

gina.lizar · January 28, 2014, 1:40pm

The last column is the name corresponding with col2 of input 1 which is at the last header at the top starting with [some alphabets] some description.

So for At1g31340 it is [OR] Ubiquitin and ubiquitin-like proteins and for At1g35370_2 it is [R] KOG0017 FOG: Transposon-encoded proteins with TYA, reverse transcriptase, integrase domains in various combinations.

Please note that names for some ids may not be found. they should be left as it is, that is no name is added to the last column.

Yoda · January 28, 2014, 2:38pm

Here is an awk program based on some assumptions:

awk '
        NR == FNR {
                if ( $0 ~ /\[[A-Z]*\]/ )
                        D = $0
                else
                        A[$NF] = D
                next
        }
        $2 in A {
                $0 = $0 FS A[$2]
        }
        1
' input2 input1

RavinderSingh13 · January 28, 2014, 2:44pm

Thanks a lot Yoda, you are too good boss. Could you please exaplain the code please.

Thanks,
R. Singh

gina.lizar · January 28, 2014, 2:44pm

thank you, this looks perfect,

gina.lizar · January 28, 2014, 2:45pm

Thank you, this looks perfect

Yoda · January 28, 2014, 2:49pm

The code reads input2 initially and searches for pattern /\[[A-Z]*\]/ which is the header / description as per OP. The value is stored in variable D

For non-header records, the value D is assigned to associate array A indexed by last column.

The code then reads input1 and append the description if key is present in array and print or else print the record as it is.