and a second with millions of values (0 to 1), like this:
2. table
______________________________
...
0.252...
0.234...
0.256...
...
______________________________
now I have to map the second list to the first table so that I get for each value the corresponding p-value (i.e. the p-value corresponding to the LOWER value in the 1. table).
one possibility would be to create an indexed array with 1000 instances in this way:
...
a[0.233]=0.006
a[0.234]=0.006
...
a[0.245]=0.005
...
then create a substring of the value of the second list like x.xxx and use this as index for the array.
is there an easier way?
thank you,
dietmar
PS: I have to use AWK, because I already read 7 several gbyte large files an combine these in one table (in minutes) and I have to convert one column on the fly to the corresponding p-values. I think every other programming language will be slower (perhaps except C++, which I can't program).
I guess that using associative arrays will be much slower than using integer indexed arrays, so this one might be a bit faster than the other approach. (On the other hand, man awk says:
, so my thoughts were wrong). However, try
awk ' {tmp = int ($1 * 1000)}
FNR==NR {while (++s < tmp) Ar = p; p = Ar[tmp] = $2; next}
FNR==1 && NR>1 {while (++s < 1000) Ar = p}
{print $1, Ar[tmp]}
# END {for (i in Ar) print i, Ar}
' file1 file2
0.25234 0.005
0.23487 0.006
0.256001 0.003
thank you both, but now i run into some other problems:
firstly one question, user8:
'NR==FNR{p[$1]=$2;next}
eats the first file and proceeds after this with the second one?
my problem: i saved the pvalue-file with R and I got something like 1e-3.
okay, I managed this as I multiplied all values by 10000 (I use now four instead of 3 digits), therefore I have values from 0 over 5 to 10000 which correspond to values from 0.0000 over 0.0005 to 1.0000 (respecting also the hint of RudiC ;)).
I do not understand you approach completely, but the while loop is run every new value from file2 (?), and as I have hundreds of millions of values, this approach has to be slower than the array approach, where the array is load once and only used every value...
the script looks now like this and works perfect (the additional p-value translation increases the time only by 1 minute from 15 to 16 minutes):
You are right, you seem not understand the script. Did you try it at all?
file1 is read exactly once, populating the array from 0 to (the highest value in file1) * 1000. On the first line of file2, before any action on file2, the rest of the array (previous + 1 up to 999) is populated. Then, for every line (of the millions) in file2, field 1 is multiplied by 1000, truncated, and the match in the array is retrieved and printed. One single action per one single line.
I'd appreciate if you could attach (in advanced edit) a representative sample of each of your files, not necessarily millions of values, and anonymized if need be, so I can test and time the scripts above by myself. Thank you.