Join lines from two files based on match

I have two files.
File1

>gi|11320906|gb|AF197889.1|_Buchnera_aphidicola
ATGAAATTTAAGATAAAAAATAGTATTTT
>gi|11320898|gb|AF197885.1|_Buchnera_aphidicola
ATGAAATTTAATATAAACAATAAAA
>gi|11320894|gb|AF197883.1|_Buchnera_aphidicola
ATGAAATTTAATATAAACAATAAAATTTTT

File2

AF197885	Uroleucon aeneum
AF197886	Uroleucon jaceae
AF197889	Uroleucon obscurum
AF197883	Uroleucon astronomus
AF197893	Uroleucon erigeronense

For all lines in file1, I want to match the term bracked by "gb|" and "." (i.e. AF197889 in the first line) to a line in file2. In this example of file1, all terms of interest start with "AF" but this isn't always the case.

If there's a match, I'd like to append the species name in file2, preceded by "_host_" to the matching line in file1, using underscores and no spaces. Desired output:

>gi|11320906|gb|AF197889.1|_Buchnera_aphidicola_host_Uroleucon_obscurum
ATGAAATTTAAGATAAAAAATAGTATTTT
>gi|11320898|gb|AF197885.1|_Buchnera_aphidicola_host_Uroleucon_aeneum
ATGAAATTTAATATAAACAATAAAA
>gi|11320894|gb|AF197883.1|_Buchnera_aphidicola_host_Uroleucon_astronomus
ATGAAATTTAATATAAACAATAAAATTTTT

With the meager skills I have, I could use "|" as a filed separator for file 1 and use awk to fill an array to find matches. But I'm not sure how to to append the file2 data, or how to accomplish it in one step. Can anyone help?

You could try something like:

awk '
FNR == NR {
        x[$1] = "_host"
        for(i = 2; i <= NF; i++)
                x[$1]=x[$1] "_" $i
        next
}
{       print $0 x[$4]
}' File2 FS='[|.]' File1

If you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk .

1 Like

Small variation in the first part:

awk 'NR==FNR{i=$1; $1="_host"; A=$0; next} {print $0 A[$4]}' OFS=_ file2 FS='[|.]' file1
1 Like