join two files based on one column

newpro · April 26, 2012, 1:39pm

Hi All,

I am trying to join to files based on one common column.

Cat File1

ID	HID
Ab_1	23
Cd	45
df     22
Vv	33

Cat File2

ID	pval
Ab_1	0.3
Cd	10
Vv	0.0444

(file1 has 18,000 rows and file 2 have between 4,000 to 8,000 rows)
Desired output:
Cat Fileout

HID	pval
23	0.3
10	45
33	0.0444

By searching the forum, I came up with this script:

awk ' FNR == NR { ab[$1] = $2 } FNR != NR { cd[$1] = $2 } END { for (a in ab)  if (a in cd) print ab[a],cd[a] }' FS='\t' OFS='\t' File1 File2 >Fileout

It seems to work fine in the beginning. But when I used it in large file with 15,000 rows it is giving errors (eg. missing some rows).

I am a beginner in scripting. I am not sure if there is error in above script or if there is a better way to do this. Any suggestion will be very helpful.

Thank you for your time,

NP

Corona688 · April 26, 2012, 2:19pm

I don't see any particular reason that'd throw up on large amounts of data, so I'd be suspect of inconsistencies(are ALL your records tab-separated?), etc. in your data.

Scrutinizer · April 26, 2012, 3:50pm

The script could be shortened to this:

awk 'NR==FNR{A[$1]=$2;next}$1 in A{print A[$1],$2}' FS='\t' OFS='\t' file1 file2

I agree there is probably an extra space somewhere before or after field 1. If there are not supposed to be any spaces in the fields themselves, You could just leave that out FS='\t' .

newpro · April 26, 2012, 5:07pm

Corona688, Scrutinizer Thanks for your time. The file is in tab de-limited format, but some IDs have long names with hyphens.

Scrutinizer, Thanks for the script. When I used your script it gave the correct merge/joining of files. Appreciate all of your help.