print specific strings only

genehunter · September 29, 2009, 11:17pm

Hello,
I have a file like this..

2    168611167    STK39    STK39    ---    27347    "serine threonine kinase 39 (STE20/SPS1 homolog, yeast)"    YES    SNP_A-2086192    rs16854601    0.001558882
6    13670256    SIRT5 /// RPS4X    SIRT5    ---    23408 /// 6191    "sirtuin (silent mating type information regulation 2 homolog) 5 (S. cerevisiae) /// ribosomal protein S4, X-linked"    YES    SNP_A-8405097    rs16874223    0.00156082
2    105439878    NCK2 /// FHL2    FHL2 /// NCK2    ---    8440 /// 2274    NCK adaptor protein 2 /// four and a half LIM domains 2    ---    SNP_A-2034891    rs41322544    0.001562043
12    80373503    PPFIA2    PPFIA2    ---    8499    "protein tyrosine phosphatase, receptor type, f polypeptide (PTPRF), interacting protein (liprin), alpha 2"    YES    SNP_A-8542673    rs17008588    0.001565901
15    41547066    TP53BP1 /// TP53BP1 /// TP53BP1    TP53BP1    ---    7158 /// 7158 /// 7158    tumor protein p53 binding protein 1 /// tumor protein p53 binding protein 1 /// tumor protein p53 binding protein 1    YES    SNP_A-1782700    rs1814538    0.001573326

I need to sort this file ascending on the last column.
Then, I need an output with two columns.
First col. with only the words that start with SNP_A and the the next column with the word found in the right of the column with SNP_A.

e.g output
SNP_A-2086192 rs16854601
SNP_A-8405097 rs16874223
Can you show me howto with awk?

Thanks for reading

daptal · September 29, 2009, 11:39pm

Does the file always have fixed number of fields ?

If yes get the number of fields and sort it , cut the reqd fields and parse it like

cat file | sort -k 12,12 | cut -f 10,11 | awk '{if ($1 ~ /SNP/) print $0}'

Changes the field columns accordingly

Cheers

genehunter · September 29, 2009, 11:46pm

The number of fields are not equal and the delimiters for the fields are also not the same as in the example above.

daptal · September 30, 2009, 1:12am

It would do you a world of good if you generate the file with the same number of fields as it would be easy for processing and understanding as well.
If for some record some field does not exist replace it with a null space and use a standard delimiter. Eg:- tab

Cheers

ripat · September 30, 2009, 2:12am

For the sample given above. Will work with space as separator and with an variable number of fields:

parse.awk

{
    for(i=1;i<NF;i++){
        if ($i ~ /^SNP_A/){
            a_str[NR]=sprintf("%s %s",$i,$(i+1))
            a_val[NR]=$NF
            break
        }
}
}
END{
    for(i in a_str) print a_str,a_val
}

That was for the parsing. Now the sort.

$ awk -f parse.awk yourFile | sort -k3,3

---------- Post updated at 08:00 AM ---------- Previous update was at 07:58 AM ----------

If you have GNU awk, you can do the sort in awk but it is a little be tricky (read: I hate gawk sort functions)

{
    for(i=1;i<NF;i++){
        if ($i ~ /^SNP_A/){
            a_str[$NF]=sprintf("%s %s",$i,$(i+1))
            break
        }
}
}
END{
    n=asorti(a_str, a_copy)
    for(i=1; i<=n; i++) print a_str[a_copy]
}

genehunter · October 14, 2009, 1:13am

Ripat,
Worked great!
Thank you so much