Match files based on either of the two columns awk

smitra · June 28, 2013, 4:09am

Dear Shell experts,
I have 2 files with structure:
File 1: ID and count

head test_GI_count1.txt 
1000094 2
10039307 1
10039641 1
10047177 11
10047359 1
1008555 2
10120302 1
10120672 13
10121776 1
10121865 32

And 2nd file:

head Protein_gi_GeneID_symbol.txt
protein_gi GeneID Symbol
10954455 1246500 repA1
10954457 1246501 repA2
10954458 1246502 leuA
10954459 1246503 leuB
10954460 1246504 leuC
10954461 1246505 leuD
31982990 1246509 ibp
31982991 1246510 repA1
10954456 3722426 pLeuDn_02

I need a final file with structure:e.g. (only consisting IDs present in file 1)

10954458 1246502 leuA 3
10954459 1246503 leuB 24
10954460 1246504 leuC 13

etc

That means I want to subset from the 2nd file which IDs are unique and copy all the cols of 2nd file and add $2 (count) from file 1

To do this first I did:

smitra:File_editing smitra$ awk 'NR == FNR {
>   k[$1]
>   next
>   }
> ($1) in k
>   ' test_GI_count1.txt  Protein_gi_GeneID_symbol_1.txt > merged_file.txt

And added $2 of 1st file separately later.
This workes perfectly unless I have some problem with new file, where first file IDs ($1) is often protein_gi, but sometimes GeneID.

e.g. in 1st file

1000094 2

where as in 2nd file

77747945 1000094 treA

That means I need to search them from $1 OR $2 of 2nd file.
Can anybody please suggest me how can I do that?
Thanks a lot,
Mitra

---------- Post updated at 09:09 AM ---------- Previous update was at 09:05 AM ----------

And also I want to add $2 (count) from file 1, together in same script, so that I will not have mismatch in col length when some IDs are absent from 2nd file.

Any suggestion will be really great.
Thanks a lot,
Mitra

zaxxon · June 28, 2013, 4:45am

I do not see a corresponding field/key in file1 and file2?

smitra · June 28, 2013, 4:59am

Sorry Zaxxon,
I don't understand what you mean. My file one do not have a heading.
But fist col is ID and second col is count

Now I want to match this ID from either 1st or 2nd col of File 2

Then want to print the corresponding whole row of file 2 added with Col2 of first file.

zaxxon · June 28, 2013, 5:18am

In file1, you say the 1st column is the ID. Where is that ID to be found in the 2nd file so that it can be counted?

smitra · June 28, 2013, 5:23am

Dear zaxxon,
that is I tried to explain for example in first file:

1000094 2
10039307 1
10039641 1
10047177 11
10047359 1

format is ist col:ID, 2nd col:count

Now I need to search each number (e.g. 1000094) from 2nd file...wither from first col or from 2nd col

I have only given the head of both the files.

And I

zaxxon · June 28, 2013, 7:43am

There we go. It makes it much easier to help when relevant examples are given
So I made an example of your given input and arranged some matched in file1 and file2.

Try:

$ cat file1
1000094 2
10039307 1
10039641 1
10954458 11
10047359 1
10954460 23
10954459 2
10120302 1
10954460 13
10121776 1
10121865 32
$ cat file2
protein_gi GeneID Symbol
10954455 1246500 repA1
10954457 1246501 repA2
10954458 1246502 leuA
10954459 1246503 leuB
10954460 1246504 leuC
10954461 1246505 leuD
31982990 1246509 ibp
31982991 1246510 repA1
10954456 3722426 pLeuDn_02
$ awk 'NR == FNR {k[$1]+=$2; next} $1 in k {print $0,k[$1]}' file1 file2
10954458 1246502 leuA 11
10954459 1246503 leuB 2
10954460 1246504 leuC 36

smitra · June 28, 2013, 10:50am

Dear zaxxon,
Thank you very much. But it still producec the similar output which I already got.
But I have the file bit more complicated. I am editing your example. Thanks for creating the example.

$ cat file1
1000094 2
10039307 1
10039641 1
10954458 11
10047359 1
10954459 2
10120302 1
10954460 13
10121776 1
10121865 32
$ cat file2
protein_gi GeneID Symbol
10954455 1246500 repA1
10954457 1246501 repA2
10954458 1246502 leuA
10954459 1246503 leuB
10954460 1246504 leuC
10954461 1246505 leuD
31982990 10121776 ibp
31982991 1246510 repA1
10954456 1000094 pLeuDn_02

Now I want this output

10954458 1246502 leuA 11
10954459 1246503 leuB 2
10954460 1246504 leuC 13
31982990 10121776 ibp 1
10954456 1000094 pLeuDn_02 2

Your advice will be really helpful..
Thanks in advance,
Best wishes,
Mitra

DGPickett · June 28, 2013, 2:37pm

Sort the files by the keys and use 'join'.

smitra · June 28, 2013, 9:06pm

sorry DGPickett, I tried such a way also but as sometimes I have to match with 2nd col (see the last example) also, it getts messed

MadeInGermany · June 29, 2013, 2:33am

awk 'NR==FNR {k[$1]+=$2; next} ($1 in k) {print $0,k[$1]; next} ($2 in k) {print $0,k[$2]}' file1 file2

Three implicit if clauses, followed by {action} where each action ends with a jump to the next cycle.

smitra · June 30, 2013, 6:51am

Thank you very much MadeInGermany. thanks for figuring out my error. its a great help

DGPickett · July 1, 2013, 12:32pm

Is that second column match always there? Join only takes one, so you need to preprocess them into one field, or postprocess out the second field mismatches. For that much trouble, you might want to store one in an associative array and then match it with the other, which can be done in bash or awk.