Hello,
I have two files which look like this
File 1
Name test1 status P
Gene1 0.00236753 1 1.00E-01
Gene2 0.134187 2 2.00E-01
Gene3 0.000608716 2 3.00E-01
Gene4 0.0016234 1 4.00E-01
Gene5 0.000665868 2 5.00E-01
and file 2
No Pos rsid a1 a2 geneid categ wgt P
1 100 SNP1 a1 a2 Gene1 HIGH -0.67249 6.91E-01
2 200 SNP2 a1 a2 Gene1 HIGH -0.719 8.49E-01
3 300 SNP3 a1 a2 Gene1 MEDIUM 2.09 1.70E-01
4 400 SNP4 a1 a2 Gene1 HIGH -0.122172 6.91E-01
5 500 SNP5 a1 a2 Gene1 HIGH -0.906466 8.49E-01
6 600 SNP6 a1 a2 Gene1 HIGH -0.02618 9.88E-01
7 700 SNP7 a1 a2 Gene2 HIGH -0.999206 6.34E-01
8 800 SNP8 a1 a2 Gene2 HIGH -0.998448 8.67E-01
9 900 SNP9 a1 a2 Gene3 HIGH -0.059699 2.94E-01
10 1000 SNP10 a1 a2 Gene4 MEDIUM 2.19 4.79E-01
11 2000 SNP11 a1 a2 Gene4 VERY HIGH 2.3 7.19E-02
12 3000 SNP12 a1 a2 Gene4 HIGH -0.992672 1.55E-01
13 4000 SNP13 a1 a2 Gene4 HIGH -0.791565 3.50E-01
14 5000 SNP14 a1 a2 Gene5 LOW 0.860334608 6.67E-02
15 6000 SNP15 a1 a2 Gene5 LOW 0.805402062 2.09E-02
16 7000 SNP16 a1 a2 Gene5 VERY HIGH 0.430167304 6.67E-02
17 8000 SNP17 a1 a2 Gene5 VERY HIGH 0.727742605 7.53E-01
18 9000 SNP18 a1 a2 Gene5 HIGH -0.999286 5.41E-01
I would like to count the "SNPs" under column "rsid" from file 2 for each corresponding "Name" in file 1 and would like to output the lowest value "P" with the corresponding categ and rs ID from file 2. So from the example above, I require an output that looks like this
Name test1 status P no of SNPs Top rs ID Top categ Top P
Gene1 0.00236753 1 1.00E-01 6 SNP3 MEDIUM 1.70E-01
Gene2 0.134187 2 2.00E-01 2 SNP7 HIGH 6.34E-01
Gene3 0.000608716 2 3.00E-01 1 SNP9 HIGH 2.94E-01
Gene4 0.0016234 1 4.00E-01 4 SNP11 VERY HIGH 7.19E-02
Gene5 0.000665868 2 5.00E-01 5 SNP15 LOW 2.09E-02
Is it possible to do this with shell script ? Any help would be appreciated.
Many thanks