Loop over awk or passing parameter

fat · December 2, 2014, 2:19am

I wrote this script which works well when I manually input 55518622 and 1
but I need this script to be generic and loop over the following table

awk '$4>(55518622-500000) && $4<(55518622+500000)' chr1_GEN2bim | awk 'BEGIN {min=1000000000; max=0;}; {\
 if($4<min && $4 != "") min = $4; if($4>max && $4 != "") max = $4; } END {print "chr"$1"\t"min"\t"max}'

Given

1       55518622        rs45613943      1.983654e-08
1       109815252       rs611917        3.439871e-11
2       98323821        kgp8891906      3.619327e-19
10      15912165        kgp3074751      4.737534e-15
15      58723426        rs1077835       8.918459e-11
16      56686970        rs148916841     2.375065e-12
19      11187358        rs144826254     3.641029e-11
19      45373276        rs11879589      1.034183e-09

I want a one off output like

chr1 minValue  maxValue
chr2 minValue maxValue
chr10 minValue maxValue
chr15 minvalue maxValue
chr16 minValue maxValue
chr19 minValue maxValue

my chr1_GEN2bim looks like this

1       rs7514195       0       198000135       
1       rs6667378       0       198000253       
1       rs114753897     0       198000439

RudiC · December 2, 2014, 4:52am

This is quite incomplete a specification. Your "Given" is in a file? You want to parse multiple input files named "chrn_GEN2bim" where n comes from your "Given" file? Is the first column of those input files always identical to n?

fat · December 2, 2014, 10:15am

Hi RudiC,

Thanks for your reply. Sorry I wasnt very clear.
Yes the "Given" is in a file.I want to parse multiple input files named "chr{n}_GEN2bim" where n comes from the first column of my "Given" file.

Thanks

SA

Corona688 · December 2, 2014, 10:22am

This could probably be optimized a lot if I knew what you were doing. Whenever you do awk | awk | kitchen | sink, you should replace it with one awk and be done with it.

while read A B C D E
do
        awk '$4>(B-500000) && $4<(B+500000)' B="$B" chr${A}_GEN2bim |
        awk 'BEGIN {min=1000000000; max=0;}; $4 == "" { next }; $4<min { min = $4; }; $4>max { max=$4; } END {print "chr"$1"\t"min"\t"max}'
done < given > output

RudiC · December 2, 2014, 10:30am

Do you want a min/max value for each chrn? Then you need to reset them for every new n.
And, you run your awk script twice for e.g. 1 od 19. So there might not be one single min/max for those as the, say, "tolerances" are different for the two calls.

Corona688 · December 2, 2014, 10:46am

I think this would do it with a single awk, run a handful of times, instead of two awks per line:

while read A B C D E
do
        echo "B=$B"
        echo "chr${A}_GEN2bim"
done < given | xargs -n 16 awk -v OFS="\t" 'F != FILENAME {
        if(F) print "chr" A, min, max;
        min=1000000000; max=0; F=FILENAME;
}

END { if(F) print "CHR" A, min, max; }

$1 != "" { A=$1; }

($4 != "") && ($4>(B-500000)) && ($4<(B+500000)) {
        if($4 < min) min=$4;
        if($4 > max) max=$4;
}' > output

fat · December 2, 2014, 12:02pm

Sorry, I didnt mention that there are two input files in which I dont know where to fit the second file.

The first file is chr1_GEN2bim. It has this format. Note that everything at first column is 1 because it is chr1_GEN2bim

1       rs7514195       0       198000135       
1       rs6667378       0       198000253       
1       rs114753897     0       198000439

so if it is chr2_GEN2bim , it would be like the following, also note that everything at first column is 2.

2       rs7514198       0       378000135       
2       rs66673789       0       98000253       
2       rs11475389     0       18000439

etc. All the files chr1_GEN2bim, chr2_GEN2bim, chr3_GEN2bim, ...chr22_GEN2bim are on the working directory

For the second file (following table), it is where I want to extract from column 1 and 2.

1       55518622        rs45613943      1.983654e-08
1       109815252       rs611917        3.439871e-11
2       98323821        kgp8891906      3.619327e-19
10      15912165        kgp3074751      4.737534e-15
15      58723426        rs1077835       8.918459e-11
16      56686970        rs148916841     2.375065e-12
19      11187358        rs144826254     3.641029e-11
19      45373276        rs11879589      1.034183e-09

What I actually want to do is read A and B simultaneously from file 2.
A is the value at the first column, while B is at the second column both from second file

I think it should be like this

while read A B  (in this case A is 1 and B is 55518622, so next A is 2 while B is 109815252, and next A is 10 and B=15912165  )
do
        awk '$4>(B-500000) && $4<(B+500000)' B="$B" chr${A}_GEN2bim |
        awk 'BEGIN {min=1000000000; max=0;}; $4 == "" { next }; $4<min { min = $4; }; $4>max { max=$4; } END {print "chr"$1"\t"min"\t"max}'
done < second_file > output

Corona688 · December 2, 2014, 12:42pm

You want while read A B C otherwise B will be 55518622 rs45613943 1.983654e-08 .

Otherwise, try it and see.

RudiC · December 3, 2014, 10:40am

There's some strange conditions involved, so I'm not too confident in the solution. Anyhow, with your sample data, try

awk     '               {min=1E100; max=-1E100; P=0
                         FN="chr"$1"_GEN2bim"
                         while (1 == getline X < FN)
                                {split (X, T)
                                 if ((T[4] > $2 - 500000) && (T[4] < $2 + 500000))
                                        {if (T[4] < min) min = T[4]
                                         if (T[4] > max) max = T[4]
                                         P=1
                                        }
                                }
                         if (P) print "chr"$1, min, max
                         close (FN)
                        }
        ' given
chr2 98000253 98000253