parsing data from a big file using keys from another smaller file

Lucky_Ali · April 6, 2011, 5:25pm

Hi,
I have 2 files
format of file 1 is:

a1
b2
a2
c2
d1
f3

format of file 2 is (tab delimited):

 a1 1.2 0.5 0.06 0.7 0.9 1 0.023
a3  0.91 0.007 0.12 0.34 0.45 1 0.7 
a2  1.05 2.3 0.25 1 0.9 0.3 0.091
b1 1 5.4 0.3 9.2 0.3 0.2 0.1
b2 3 5 7 0.9 1 9 0 1
b3 0.001 1 2.3 4.6 8.9 10 0 1 0
c1 0.9 1 2.3 5.7 8.9 9 0 1
c2 1 2.4 5.7 0.13 1.9 2 5 8
c3 5.7 9 10 11 0.2 0.7 0.9
d1 9.0 5 8 4.5 9 0.99 1.3 1 0
d2 2 4.6 7 9 9 10 11 0 1 2.4 0.44
f1 7 8 4.5 6.8 9.21 0 1 8 4 9 10
f3 0 1 2.3 4.0 3.14 0 1 0.005

I want to use the data in file as a key and parse out the correponding values from file 2 into a third file.

such that file 3 is:

a1 1.2 0.5 0.06 0.7 0.9 1 0.023
b2 3 5 7 0.9 1 9 0 1
a2  1.05 2.3 0.25 1 0.9 0.3 0.091
c2 1 2.4 5.7 0.13 1.9 2 5 8
d1 9.0 5 8 4.5 9 0.99 1.3 1 0
f3 0 1 2.3 4.0 3.14 0 1 0.005

I need to have the same order of the keys similar to the file 1 in file 3.
please let me know the best way to generate the 3rd file either using awk or sed.
LA

Corona688 · April 6, 2011, 5:35pm

< datafile awk '
BEGIN { FS="\t"
        # get the very first key.
        getline key < "keyfile" }
{
        # If the data's ahead in order, read keys until you catch up
        # but don't read keys past EOF.
        while(key && (key < $1))
                getline key < "keyfile"

        if(key && (key == $1))
                print;
}'

Lucky_Ali · April 6, 2011, 5:50pm

My real data contains 3000 keys. When I implemented the code you sent, I was only able to parse out values for 20 keys only.
LA

Corona688 · April 6, 2011, 5:54pm

It can't be running out of room, it's not storing anything, so it's not related to the quantity of data. I think, either the keys or the data aren't in legographical order, or, the key file contains blank lines which would make it give up instantly.

Could you post the smallest possible sample of data that shows the problem?

Lucky_Ali · April 6, 2011, 5:59pm

what do you mean by legographical order? I don't see any gaps in either files

Corona688 · April 6, 2011, 6:03pm

Sorted.

Lucky_Ali · April 6, 2011, 6:07pm

I can't sort the key file as I need the file 3 to be generated and ordered in the same order. But I could sort the datafile. The main catch is to keep the same order in the output file as it is present in the key file
LA

Corona688 · April 6, 2011, 6:27pm

[edit] Same order as the key file? Hm.... Will there be more than one matching line per key?

sk1418 · April 6, 2011, 6:34pm

awk -F'\t' 'FNR==NR{k[$1]=NR} NR>FNR{ if($1 in k) {$0=k[$1]"|"$0;print $0;}}' file1.txt file2.txt |sort -t"|" -k1 |cut -d"|" -f2> file3.txt

will do your job.

Corona688 · April 6, 2011, 6:36pm

Use nawk or gawk.

< data nawk '
BEGIN { FS="\t"
        N=0;
        while(getline key < "keyfile")
        {
                keys[key]=1;
                order[N++]=key;
        }
 }

{       if(keys[$1]) out[$1]=out[$1] $0 "\n";   }

END {   for(M=0; M<N; M++)
        if(out[order[M]])       printf("%s", out[order[M]]);
}'

yinyuemi · April 6, 2011, 7:46pm

awk 'NR==FNR{a[$1]=$0;next}NF==1{print a[$1]}' file2 file1