Complex awk problem

dietmar13 · February 16, 2013, 12:18pm

hello,

i have a complex awk problem...

i have two tables, one with a value (0 to 1) and it's corresponding p-value, like this:

table:
______________________________
value p-value
... ...
0.254 0.003
0.245 0.005
0.233 0.006
... ...
______________________________

and a second with millions of values (0 to 1), like this:
2. table
______________________________
...
0.252...
0.234...
0.256...
...
______________________________

now I have to map the second list to the first table so that I get for each value the corresponding p-value (i.e. the p-value corresponding to the LOWER value in the 1. table).

expexted output:
______________________________
... ...
0.252... 0.005
0.234... 0.006
9.256... 0.003
... ...
______________________________

one possibility would be to create an indexed array with 1000 instances in this way:
...
a[0.233]=0.006
a[0.234]=0.006
...
a[0.245]=0.005
...

then create a substring of the value of the second list like x.xxx and use this as index for the array.

is there an easier way?

thank you,

dietmar

PS: I have to use AWK, because I already read 7 several gbyte large files an combine these in one table (in minutes) and I have to convert one column on the fly to the corresponding p-values. I think every other programming language will be slower (perhaps except C++, which I can't program).

user8 · February 16, 2013, 12:53pm

I don't see any easier way than what you suggested. Below some (untested) code:

awk 'NR==FNR{p[$1]=$2;next}{ v=substr($1,1,5); print v in p ? p[v] : "Some warning"}' table1 table2

RudiC · February 16, 2013, 1:30pm

I guess that using associative arrays will be much slower than using integer indexed arrays, so this one might be a bit faster than the other approach. (On the other hand, man awk says:

, so my thoughts were wrong). However, try

awk     '               {tmp = int ($1 * 1000)}
         FNR==NR        {while (++s < tmp)  Ar = p; p = Ar[tmp] = $2; next}
         FNR==1 && NR>1 {while (++s < 1000) Ar = p}
                        {print $1, Ar[tmp]}
#        END            {for (i in Ar) print i, Ar}
        ' file1 file2
0.25234 0.005
0.23487 0.006
0.256001 0.003

Make sure file1 is sorted in ascendig order!

dietmar13 · February 16, 2013, 2:27pm

thank you both, but now i run into some other problems:

firstly one question, user8:

'NR==FNR{p[$1]=$2;next}

eats the first file and proceeds after this with the second one?

my problem: i saved the pvalue-file with R and I got something like 1e-3.
okay, I managed this as I multiplied all values by 10000 (I use now four instead of 3 digits), therefore I have values from 0 over 5 to 10000 which correspond to values from 0.0000 over 0.0005 to 1.0000 (respecting also the hint of RudiC ;)).

How do I get the integer from a value in awk?

 
int(value*10000)

I hope so - I will try...

RudiC · February 16, 2013, 2:59pm

Not sure I understand your new problems. Why don't you add a "0" to both occurrences of "1000" making them "10000" in above script and give it a shot?

dietmar13 · February 16, 2013, 3:47pm

I do not understand you approach completely, but the while loop is run every new value from file2 (?), and as I have hundreds of millions of values, this approach has to be slower than the array approach, where the array is load once and only used every value...

the script looks now like this and works perfect (the additional p-value translation increases the time only by 1 minute from 15 to 16 minutes):

dir='/home/ws/R_workspace/OVCAD'
fn='miR.MIC'
pv='pvalues_220.txt'
cd $dir
fname=${fn%.*}
echo $fname
echo -e "N1\tN2\tMIC\tMAS\tMEV\tMCN\tMICR2\tpearson\tnon-linearity\tp-value" > ${fname}.mine
 
gawk 'BEGIN { FS = "\t"; OFS="\t" } ; NR==FNR{p[$1]=$2;next} { NR==1;{ for (i = 1; i <= NF; i++) name=$i } }; \
 { split(FILENAME,fname,"."); fn=fname[1]; getline < (fn".MAS"); getline < (fn".MEV"); getline < (fn".MCN"); getline < (fn".MICR2"); getline < (fn".cor"); getline < (fn".nl") } \
 { for(k=1; k <= NF; k++) { getline; split($0, MIC, "\t") ; \
 getline mas < (fn".MAS"); split(mas, MAS, "\t") ; \
 getline mev < (fn".MEV"); split(mev, MEV, "\t") ; \
 getline mcn < (fn".MCN"); split(mcn, MCN, "\t") ; \
 getline micr2 < (fn".MICR2"); split(micr2, MICR2, "\t") ; \
 getline cor < (fn".cor"); split(cor, COR, "\t") ; \
 getline nl < (fn".nl"); split(nl, NL, "\t") ; \
 for(j=k+1; j <= NF; j++)  \
 print name[k],name[j],MIC[j],MAS[j],MEV[j],MCN[j],MICR2[j],COR[j],NL[j],p[int(MIC[j]*10000)] } }' $pv ${fname}.MIC >> ${fname}.mine

RudiC · February 16, 2013, 4:03pm

You are right, you seem not understand the script. Did you try it at all?
file1 is read exactly once, populating the array from 0 to (the highest value in file1) * 1000. On the first line of file2, before any action on file2, the rest of the array (previous + 1 up to 999) is populated. Then, for every line (of the millions) in file2, field 1 is multiplied by 1000, truncated, and the match in the array is retrieved and printed. One single action per one single line.

I'd appreciate if you could attach (in advanced edit) a representative sample of each of your files, not necessarily millions of values, and anonymized if need be, so I can test and time the scripts above by myself. Thank you.