Compare two files and extract info

nans · February 18, 2015, 7:08am

Hello,
I have two files which look like this
File 1

Name    test1    status    P
Gene1    0.00236753    1    1.00E-01
Gene2    0.134187    2    2.00E-01
Gene3    0.000608716    2    3.00E-01
Gene4    0.0016234    1    4.00E-01
Gene5    0.000665868    2    5.00E-01

and file 2

No    Pos    rsid    a1    a2    geneid    categ    wgt    P
1    100    SNP1    a1    a2    Gene1    HIGH    -0.67249    6.91E-01
2    200    SNP2    a1    a2    Gene1    HIGH    -0.719    8.49E-01
3    300    SNP3    a1    a2    Gene1    MEDIUM    2.09    1.70E-01
4    400    SNP4    a1    a2    Gene1    HIGH    -0.122172    6.91E-01
5    500    SNP5    a1    a2    Gene1    HIGH    -0.906466    8.49E-01
6    600    SNP6    a1    a2    Gene1    HIGH    -0.02618    9.88E-01
7    700    SNP7    a1    a2    Gene2    HIGH    -0.999206    6.34E-01
8    800    SNP8    a1    a2    Gene2    HIGH    -0.998448    8.67E-01
9    900    SNP9    a1    a2    Gene3    HIGH    -0.059699    2.94E-01
10    1000    SNP10    a1    a2    Gene4    MEDIUM    2.19    4.79E-01
11    2000    SNP11    a1    a2    Gene4    VERY HIGH    2.3    7.19E-02
12    3000    SNP12    a1    a2    Gene4    HIGH    -0.992672    1.55E-01
13    4000    SNP13    a1    a2    Gene4    HIGH    -0.791565    3.50E-01
14    5000    SNP14    a1    a2    Gene5    LOW    0.860334608    6.67E-02
15    6000    SNP15    a1    a2    Gene5    LOW    0.805402062    2.09E-02
16    7000    SNP16    a1    a2    Gene5    VERY HIGH    0.430167304    6.67E-02
17    8000    SNP17    a1    a2    Gene5    VERY HIGH    0.727742605    7.53E-01
18    9000    SNP18    a1    a2    Gene5    HIGH    -0.999286    5.41E-01

I would like to count the "SNPs" under column "rsid" from file 2 for each corresponding "Name" in file 1 and would like to output the lowest value "P" with the corresponding categ and rs ID from file 2. So from the example above, I require an output that looks like this

Name    test1    status    P    no of SNPs    Top rs ID    Top categ    Top P
Gene1    0.00236753    1    1.00E-01    6    SNP3    MEDIUM    1.70E-01
Gene2    0.134187      2    2.00E-01    2    SNP7    HIGH    6.34E-01
Gene3    0.000608716   2    3.00E-01    1    SNP9    HIGH    2.94E-01
Gene4    0.0016234     1    4.00E-01    4    SNP11  VERY HIGH    7.19E-02
Gene5    0.000665868   2    5.00E-01    5    SNP15   LOW    2.09E-02

Is it possible to do this with shell script ? Any help would be appreciated.

Many thanks

RudiC · February 18, 2015, 8:01am

Any attempts from your side?

---------- Post updated at 13:52 ---------- Previous update was at 13:30 ----------

Anyhow, try

awk     '!MIN[$6]       {MIN[$6] = 1E100}
         FNR==NR        {CNT[$6]++
                         if ($9 < MIN[$6]) {MIN[$6]=$9; F3[$6]=$3; F7[$6]=$7}
                         next
                        }
         FNR==1         {print $0, " no of SNPs    Top rs ID    Top categ    Top P"
                         next
                        }
                        {$1=$1}
                        {print $0, CNT[$1], F3[$1], F7[$1], MIN[$1]}
        ' file2 OFS="\t" file1
Name    test1    status    P     no of SNPs    Top rs ID    Top categ    Top P
Gene1   0.00236753      1       1.00E-01        6       SNP3    MEDIUM  1.70E-01
Gene2   0.134187        2       2.00E-01        2       SNP7    HIGH    6.34E-01
Gene3   0.000608716     2       3.00E-01        1       SNP9    HIGH    2.94E-01
Gene4   0.0016234       1       4.00E-01        4       SNP12   HIGH    1.55E-01
Gene5   0.000665868     2       5.00E-01        5       SNP15   LOW     2.09E-02

---------- Post updated at 14:01 ---------- Previous update was at 13:52 ----------

This is running into problems as the space in "VERY HIGH" shifts the field count... so the field separator needs to be <TAB>, and both files should comply...replace
' file2 OFS="\t" file1
with
' FS="\t" OFS="\t" file2 file1 .

nans · February 18, 2015, 8:09am

Thank you. Well, all this while I was doing it with grep and a bit of manual work....