Match data based on two fields, and append to a line

jamessmith01 · September 3, 2010, 10:56am

I need to write a program to do something like a 'vlookup' in excel. I want to match data from file2 based on two fields (where both match) in file1, and for matching lines, add the data from two of the fields from file2 to file1.

If anyone knows something in perl or awk that can do this, I'd be very happy

e.g.:
(match on field 1 and 2, and append column 4 and 5 to column 5 and 6)

file1

1 A X a
2 B Y b
3 C Z c
4 D X d
5 E Y e

file2

1 A Y b
2 C Z c
3 D X d
4 E Y e
5 E Z c

output

1 A X a Y b
2 B Y b
3 C Z c
4 D X d
5 E Y e Z c

vish_indian · September 3, 2010, 1:32pm

#!/bin/sh

IRS=" "
num=1
while read var1 var2 var3 var4; do
 awk -v var1=${var1} -v var2=${var2} -v var3=${var3} -v var4=${var4} -v num=${num} 'NR>=num{if($1==var1 && $2==var2){ print $0" "var3" "var4; exit} else { print $0; exit}}' file1
 num=`expr $num + 1`
done < file2

This will only look for the entry in the same position, if entry exists but at a different position, this would fail.

radoulov · September 3, 2010, 3:21pm

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

awk 'NR == FNR {
  _[$1, $2] = FS $3 FS $4
  next
  }
$NF = $NF _[$1, $2]
  ' file2 file1

jamessmith01 · September 6, 2010, 5:35am

Thanks very much for your replies.

I'm using Fedora, will this run the Gawk program okay?

Also what would I need to add to the code to say, "if a line from file2 doesn't match a line from file1 on those two columns, add the whole line to the bottom of file1". How could I do that, please?

Thanks again!

pravin27 · September 6, 2010, 5:51am

Hi, Try this,

awk 'NR==FNR{a[$1$2]=$3FS$4;next} { if ($1$2 in a) { print $0,a[$1$2]} else {print $0}}' file2 file1

jamessmith01 · September 6, 2010, 6:57am

Hi pravin27,

I tried your code with the following inputs:

file1
1 A X a
2 B Y b
3 C Z c
4 D X d
5 E Y e
1 A Y b
2 C Z c
3 D X d

file2
1 A Y b
2 C Z c
3 D X d
4 E Y e
5 E Z c
4 D X d
5 E Y e
6 A X b

Using "./test1.sh > file3"
where test1.sh =

!/bin/sh

awk 'NR==FNR{a[$1$2]=$3FS$4;next} { if ($1$2 in a) { print $0,a[$1$2]} else {print $0}}' file2 file1

However the output was:

file3
1 A X a Y b
2 B Y b
3 C Z c
4 D X d X d
5 E Y e Y e
1 A Y b Y b
2 C Z c Z c
3 D X d X d

The last line of file2 (6 A X b) was not added to the output file.

Any ideas?
Thanks

pravin27 · September 6, 2010, 7:06am

You want add two columns from file 1 to file 2 OR from file 2 to file1 ?

My code code will check columns from file2 with file 1 if match then print current record of file1 and add two columns from file2 else print the current line of file1

jamessmith01 · September 6, 2010, 7:13am

I want to add two columns from file2 to matching lines from file1.

If there is no match for a line in file2, I need that line to be added to the bottom of file1.

So it would be something like:

match columns from file2 to file1
if match found, add columns to line in file1
if no match, leave the line in file1 as it was, but add the line from file2 to the bottom of file1

radoulov · September 6, 2010, 7:20am

Untested:

awk 'END {
  for (r in _) print _[r]
  }
NR == FNR {
  _[$1, $2] = $0; next
  }
($1, $2) in _ {
  split(_[$1, $2], t); delete _[$1, $2]
  print $0, t[3], t[4]
  }' file2 file1

pravin27 · September 6, 2010, 7:34am

Try this,

awk 'NR==FNR{a[$1FS$2]=$3FS$4;next} { if ($1FS$2 in a) { print $0,a[$1FS$2];delete a[$1FS$2]} else {print $0}} END{for (i in a) { print i,a}}' file2 file1

jamessmith01 · September 6, 2010, 7:52am

Output from radoulov's code:

1 A X a Y b
4 D X d X d
5 E Y e Y e
2 C Z c Z c
3 D X d X d
6 A X b
4 E Y e

It seems to have missed out lines from file1:
2 B Y b
3 C Z c
1 A Y b

Output from pravin27's code:

1 A X a Y b
2 B Y b
3 C Z c
4 D X d X d
5 E Y e Y e
1 A Y b
2 C Z c Z c
3 D X d X d
6 A X b
4 E Y e
(correct)

That's great, thanks. If I wanted to specify a ; as a delimiter (as well as a tab), what code would I add?

Also, let's say that if a line from file2 does not match file1, I want to add it to the bottom, BUT I want to shift the 3rd and 4th fields into the 5th and 6th field positions, so I know the line is from file2.

Expected output:

1 A X a Y b
2 B Y b
3 C Z c
4 D X d X d
5 E Y e Y e
1 A     Y b
2 C Z c Z c
3 D X d X d
6 A     X b
4 E     Y e

Is that possible?

radoulov · September 6, 2010, 8:02am

I don't understand why:

2 B Y b
3 C Z c

and not:

2 B    Y b
3 C    Z c

jamessmith01 · September 6, 2010, 8:05am

I want to keep the data from file1 in a separate field to data from file2, so I know where the data in those fields came from.

E.g. in context:

file1:
Chromosome Gene Level
Chr1 Abbb 13
Chr4 Fxxx 34
Chr6 Usbs 100

file2:
Chromosome Gene Level
Chr2 Odjd 28
Chr6 Usbs 65
Chr8 Wuws 22

output:
Chromosome Gene Level1 Level2
Chr1 Abbb 13
Chr4 Fxxx 34
Chr6 Usbs 100 65
Chr2 Odjd  28
Chr8 Wuws  22

pravin27 · September 6, 2010, 8:10am

Hi , Try this,

awk 'NR==FNR{a[$1FS$2]=$3FS$4;next} { if ($1FS$2 in a) { print $0,a[$1FS$2];delete a[$1FS$2]} else {print $0}} END{for (i in a) { print i,FS,FS,a}}' file2 file1

jamessmith01 · September 6, 2010, 8:30am

Hi pravin27,

Thanks for the code.

edit: sorry I think I was wrong, hold on while I check this...

---------- Post updated at 01:30 PM ---------- Previous update was at 01:18 PM ----------

Okay sorry about the confusion. I think the code works, but I have some duplicates in my input files, so it is behaving strangely. I will remove the duplicates and test again.

Thanks.

pravin27 · September 6, 2010, 8:36am

Hi, I think this will solve your problem,

awk 'NR==FNR{a[$1FS$2]=$3FS$4;next} { if ($1FS$2 in a) { print $0,a[$1FS$2];b[$1FS$2]=++i} else {print $0}} END{for (i in b) {delete a;} for (j in a){print j,FS,FS,a[j]}}' file2 file1

jamessmith01 · September 6, 2010, 8:40am

Tested and it works perfectly, great.

Please can you tell me how I would apply this code to my specific dataset?

Here is a sample from my file2. I would like to match on fields 1 and 4 ("chr1" and "5473") and copy data from fields 6 ("0") and 14 ("RPKM "0.00"").

chr1	protein_coding	exon	5473	5485	0	+	.	 gene_id "ENSRNOG00000021490"; transcript_id "ENSRNOT00000044270"; exon_number "1"; gene_name "Vom2r-ps1"; transcript_name "Vom2r-ps1"; RPKM "0.00";

The first 9 fields are separated by a tab, but the other fields are separated by a ";".

pravin27 · September 6, 2010, 9:05am

Post your actual input files and expected output accordingly.

jamessmith01 · September 6, 2010, 9:47am

The files are very large so I have posted some lines from them.

file 1:

chr1	protein_coding	exon	2256218	2256307	1	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "1"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "0.50";
chr1	protein_coding	exon	2256312	2256615	1	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "2"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "0.15";
chr1	protein_coding	exon	2262345	2262505	19	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "3"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "5.31";
chr1	protein_coding	exon	2265566	2265707	22	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "4"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "6.81";
chr1	protein_coding	exon	2270553	2270678	12	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "5"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "4.28";
chr1	protein_coding	exon	2273530	2273739	37	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "6"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "7.82";

file2:

chr1	protein_coding	exon	2256218	2256307	0	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "1"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "0.00";
chr1	protein_coding	exon	2256312	2256615	1	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "2"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "0.14";
chr1	protein_coding	exon	2262345	2262505	26	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "3"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "7.02";
chr1	protein_coding	exon	2265566	2265707	26	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "4"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "7.96";
chr1	protein_coding	exon	2270553	2270678	16	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "5"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "5.52";
chr1	protein_coding	exon	2273530	2273739	30	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "6"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "6.11";

output:

chr1	protein_coding	exon	2256218	2256307	1	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "1"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "0.50";	0	RPKM "0.00"
chr1	protein_coding	exon	2256312	2256615	1	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "2"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "0.15";	1	RPKM "0.14"
chr1	protein_coding	exon	2262345	2262505	19	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "3"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "5.31";	26	RPKM "7.02"
chr1	protein_coding	exon	2265566	2265707	22	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "4"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "6.81";	26	RPKM "7.96"
chr1	protein_coding	exon	2270553	2270678	12	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "5"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "4.28";	16	RPKM "5.52"
chr1	protein_coding	exon	2273530	2273739	37	+	.	 gene_id "ENSRNOG00000014303"; transcript_id "ENSRNOT00000019181"; exon_number "6"; gene_name "Lrp11"; transcript_name "Lrp11"; RPKM "7.82";	30	RPKM "6.11"

pravin27 · September 6, 2010, 11:57am

Try this,

awk 'NR==FNR{a[$1FS$4]=$6FS$19FS$20;next} { if ($1FS$4 in a) { print $0,a[$1FS$4];delete a[$1FS$4]} else {print $0}} END{for (i in a) { print i,FS,FS,a}}' file2 file1