Match single line in file1 to groups of lines in file2

pathunkathunk · March 5, 2014, 10:44pm

I have two files.

File 1 is a two-column index file, e.g.

comp11084_c0_seq6:130-468(-) comp12746_c0_seq3:140-478(+)
comp11084_c0_seq3:201-539(-) comp12746_c0_seq2:191-529(+)

File 2 is a sequence file with headers named with the same terms that populate file 1.

>comp11084_c0_seq6:130-468(-)
MRYVAAYLLASLSGKEPSSDEVEKILSSVGIESDSSKLSLVIKELKGKNVDEVIESGRSKLAS
>comp12746_c0_seq3:140-478(+)
MRYVAAYLLASLSGKEPSSDEVEKILSSVGIESDSSKLSLVIKELKGKNVDEVIESGRSKLAS

>comp11084_c0_seq3:201-539(-)
MRYVAAYLLASFSGKEPTSDEIEKILSSVGIESDSDKVSLVVKELKGKNVDEVIESGRSKLAS
>comp12746_c0_seq2:191-529(+)
MRYVAAYLLASFSGKEPTSDEIEKILSSVGIESDSDKVSLVVKELKGKNVDEVIESGRSKLAS

>comp11084_c0_seq3:201-539(-)
MSDTSNVNRLEELGKMKVNDLKKELKARGLSTVGNKQELIDRMINHSESSVLDIEDTVLDE
>comp12601_c0_seq4:132-965(-)
MSDTSNVNRLEELGKMKVNDLKKELKARGLSTVGNKQELIDRMINHSESSVLDIEDTVLDE

All pairs of terms in file 1 "head" a pair of sequences in file 2. These are the pairs of sequences I want to extract. File 2 also has sequence pairs with headers not found in as pairs in file 1 (e.g. the third sequence in this example), which I want to exclude.

Output:

>comp11084_c0_seq6:130-468(-)
MRYVAAYLLASLSGKEPSSDEVEKILSSVGIESDSSKLSLVIKELKGKNVDEVIESGRSKLAS
>comp12746_c0_seq3:140-478(+)
MRYVAAYLLASLSGKEPSSDEVEKILSSVGIESDSSKLSLVIKELKGKNVDEVIESGRSKLAS

>comp11084_c0_seq3:201-539(-)
MRYVAAYLLASFSGKEPTSDEIEKILSSVGIESDSDKVSLVVKELKGKNVDEVIESGRSKLAS
>comp12746_c0_seq2:191-529(+)
MRYVAAYLLASFSGKEPTSDEIEKILSSVGIESDSDKVSLVVKELKGKNVDEVIESGRSKLAS

I can print lines that match in two files

awk ' NR == FNR { arr[$1$2]=1; next } arr[$2$1] {print $1, $2} '

But I don't know how to deal with matching one line in file 1 to multiple lines in file2.
Help me out?

Scrutinizer · March 6, 2014, 12:59am

You could try something like this:

awk 'NR==FNR{A[">" $1,">" $2]; next} ($1,$3) in A' file1 RS= ORS='\n\n file2

or

awk 'NR==FNR{A[$1,$2]; next} ($2,$5) in A' file1 FS=\> RS= ORS='\n\n' file2

They make use of the empty lines between records in the second file, by using an empty RS variable..