Matching two files with special field separator

Hello,

I have a file with such structure:

>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000012|ENSGALT00000000013|57|1145|1155
AAAAAAGGTCCTGTGTGC
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG

I want to use another file to extract the ones that have a specific idea in the first part, that is to use this file:

ENSGALG00000000011
ENSGALG00000000015

To get the final output like this:

>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG

I know this code:

awk 'FNR == NR {_[$1]++} FNR < NR {if ( $1 in _ ) print $1, $0}' filetwo fileone

to compare the first fields of two files and print the matched ones but because of this special field separators, I don't know how this is feasible with this example.

Thanks a lot in advance for your help.
Cheers,

Once before, I had a similar situation.
1) I appended the '|' character to that 2nd file
2) I then used the grep with -f file option

Is this a possible solution for you?

i think you can give a try with

awk -F 

option to specify the filed limiter of your choice.

1 Like

Ok, I added the

-F
awk -F "|" 'FNR == NR {a[$1]++} FNR < NR {if ( $1 in a ) print $0}' filetwo fileone

and it works but it only prints the headers and not the content, that is the sequence of letters below it, sorry for this question but how can I get over this problem?

Thanks!

$ cat file1
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000012|ENSGALT00000000013|57|1145|1155
AAAAAAGGTCCTGTGTGC
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG
$ cat file2
ENSGALG00000000011
ENSGALG00000000015
$ awk -F"|" 'FNR==NR{A[">"$1];next}($1 in A){print;getline;print}' file2 file1
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG
2 Likes

already solution provided by akshay

Oh, thanks, but now, there is another problem, in my actual file, the content of each of the headers is longer than one line, for example:

>ENSGALG00000014675|ENSGALT00000023647|1|1603|1605
cttttccactttgctctcatcCTGCTATTGGATTTgagatgcatgtcTGTTAATATTGTA
GCCTTTGGAAATGAAAGAGATGGATTTTCTGAAGACAATCAGCAGTCAAGTCTGATCTGG
AGCTATCTAGGGAGAAGTGCTCTCATTTCAGAGACTGAAAGTGGTCTGTTGCTGAATTCT
GCCAATCACATTAGAAATCCTGTTTTTACTGAATATCAAGCCTGCGTGTTTGGAAATGTC
AGATTGGTGGTACATGACTGTCCTCTTTGGGATATATTTGACAGTGACTGGTATACTTCT
CGCAGTCTCATTGGAGGAGCTGATATTATTGTGATTaaatactctgtcaatGACAAGACT
TCATTTCAAGAATTAAAGGACAGTTATGTCCCAATGATAAAAAAAGCGTTAAACCACTGT
TCAGTTCCAGTAATAATTTCTGCTATTGGTGCAAGAAAAAATGTGCCTTGTACCTGCCCA
CTGTGCACTTCAGACAGAAGGAGCTGTGTTACTTCTTCTGAAGGAGttcagcttgctaaa
gaactaggagctacgtatcttgaattgcnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnggaatattttatgatccaaagTTTGAATCGGAAGTCATCTGAAAAA
ATGAAGAAAAGAAGAAAGACCCAGAAGTACCATCGAGTTAAACCCCCTCAGCTTGAACAA
CCAGAAAAAATGCCAATCTTAAGAGGTGAAGCCTCACATTATGACTCTGATTTACACAAG
TTGCTGTCCTGCTGCCAGTGTGTGGATGTGATATTTTACTCAGAAGACTTAAAGAAAGTA
GTAGAAGCTCACAAGATCATTTTGTGCTCTGTAAGCCATGTCTTCATGTTACTTTTCAAA
GTGAAGAGTCCAGCTGATATTCATGATTCTGCTATCATACGGACTGCGCAAAGTCTCTTT
GCAGTGAACAGTGAAGCTGTGTTTCCGTTTCCTAGCAGTGGCTCATCATGCGACCCACCA
GTAAGAGTCATTGTTAAAGACTCCATCTTCTGTTCTTGTTTGTCAGACATTCTACACTTC
ATTTATTCAGGTGCTTTCCAGTGGGAACGGTTAGAAGAAGATATAAAGAAGAAGCTAA

Using this script:

awk -F"|" 'FNR==NR{A[">"$1];next}($1 in A){print;getline;print}'

prints only the first line of each content, is there a way to solve this? thanks!

Try something like this [not tested]

$ awk -F"|" 'FNR==NR{A[">"$1];next}($1 in A){print;f=1}f && !/^>/' file2 file1

Try also

awk 'NR==FNR{T[$1]; next} {for (i in T) {if ($0 ~ "^"i) print RS $0}}' file2 RS=">" file1