Homa
December 9, 2013, 7:35am
1
Hello,
I have a file with such structure:
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000012|ENSGALT00000000013|57|1145|1155
AAAAAAGGTCCTGTGTGC
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG
I want to use another file to extract the ones that have a specific idea in the first part, that is to use this file:
ENSGALG00000000011
ENSGALG00000000015
To get the final output like this:
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG
I know this code:
awk 'FNR == NR {_[$1]++} FNR < NR {if ( $1 in _ ) print $1, $0}' filetwo fileone
to compare the first fields of two files and print the matched ones but because of this special field separators, I don't know how this is feasible with this example.
Thanks a lot in advance for your help.
Cheers,
joeyg
December 9, 2013, 7:41am
2
Once before, I had a similar situation.
1) I appended the '|' character to that 2nd file
2) I then used the grep with -f file option
Is this a possible solution for you?
zozoo
December 9, 2013, 8:02am
3
homa:
Hello,
I have a file with such structure:
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000012|ENSGALT00000000013|57|1145|1155
AAAAAAGGTCCTGTGTGC
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG
I want to use another file to extract the ones that have a specific idea in the first part, that is to use this file:
ENSGALG00000000011
ENSGALG00000000015
To get the final output like this:
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG
I know this code:
awk 'FNR == NR {_[$1]++} FNR < NR {if ( $1 in _ ) print $1, $0}' filetwo fileone
to compare the first fields of two files and print the matched ones but because of this special field separators, I don't know how this is feasible with this example.
Thanks a lot in advance for your help.
Cheers,
i think you can give a try with
awk -F
option to specify the filed limiter of your choice.
1 Like
Homa
December 9, 2013, 9:24am
4
Ok, I added the
-F
awk -F "|" 'FNR == NR {a[$1]++} FNR < NR {if ( $1 in a ) print $0}' filetwo fileone
and it works but it only prints the headers and not the content, that is the sequence of letters below it, sorry for this question but how can I get over this problem?
Thanks!
$ cat file1
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000012|ENSGALT00000000013|57|1145|1155
AAAAAAGGTCCTGTGTGC
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG
$ cat file2
ENSGALG00000000011
ENSGALG00000000015
$ awk -F"|" 'FNR==NR{A[">"$1];next}($1 in A){print;getline;print}' file2 file1
>ENSGALG00000000011|ENSGALT00000000012|57|1123|1125
AACTGTGTGTTTTTT
>ENSGALG00000000015|ENSGALT00000000014|57|1144|1155
AAAATGTGTGTGTGTGTGTGTG
2 Likes
zozoo
December 9, 2013, 9:30am
6
already solution provided by akshay
Homa
December 9, 2013, 9:38am
7
Oh, thanks, but now, there is another problem, in my actual file, the content of each of the headers is longer than one line, for example:
>ENSGALG00000014675|ENSGALT00000023647|1|1603|1605
cttttccactttgctctcatcCTGCTATTGGATTTgagatgcatgtcTGTTAATATTGTA
GCCTTTGGAAATGAAAGAGATGGATTTTCTGAAGACAATCAGCAGTCAAGTCTGATCTGG
AGCTATCTAGGGAGAAGTGCTCTCATTTCAGAGACTGAAAGTGGTCTGTTGCTGAATTCT
GCCAATCACATTAGAAATCCTGTTTTTACTGAATATCAAGCCTGCGTGTTTGGAAATGTC
AGATTGGTGGTACATGACTGTCCTCTTTGGGATATATTTGACAGTGACTGGTATACTTCT
CGCAGTCTCATTGGAGGAGCTGATATTATTGTGATTaaatactctgtcaatGACAAGACT
TCATTTCAAGAATTAAAGGACAGTTATGTCCCAATGATAAAAAAAGCGTTAAACCACTGT
TCAGTTCCAGTAATAATTTCTGCTATTGGTGCAAGAAAAAATGTGCCTTGTACCTGCCCA
CTGTGCACTTCAGACAGAAGGAGCTGTGTTACTTCTTCTGAAGGAGttcagcttgctaaa
gaactaggagctacgtatcttgaattgcnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnggaatattttatgatccaaagTTTGAATCGGAAGTCATCTGAAAAA
ATGAAGAAAAGAAGAAAGACCCAGAAGTACCATCGAGTTAAACCCCCTCAGCTTGAACAA
CCAGAAAAAATGCCAATCTTAAGAGGTGAAGCCTCACATTATGACTCTGATTTACACAAG
TTGCTGTCCTGCTGCCAGTGTGTGGATGTGATATTTTACTCAGAAGACTTAAAGAAAGTA
GTAGAAGCTCACAAGATCATTTTGTGCTCTGTAAGCCATGTCTTCATGTTACTTTTCAAA
GTGAAGAGTCCAGCTGATATTCATGATTCTGCTATCATACGGACTGCGCAAAGTCTCTTT
GCAGTGAACAGTGAAGCTGTGTTTCCGTTTCCTAGCAGTGGCTCATCATGCGACCCACCA
GTAAGAGTCATTGTTAAAGACTCCATCTTCTGTTCTTGTTTGTCAGACATTCTACACTTC
ATTTATTCAGGTGCTTTCCAGTGGGAACGGTTAGAAGAAGATATAAAGAAGAAGCTAA
Using this script:
awk -F"|" 'FNR==NR{A[">"$1];next}($1 in A){print;getline;print}'
prints only the first line of each content, is there a way to solve this? thanks!
Try something like this [not tested]
$ awk -F"|" 'FNR==NR{A[">"$1];next}($1 in A){print;f=1}f && !/^>/' file2 file1
RudiC
December 9, 2013, 2:11pm
9
Try also
awk 'NR==FNR{T[$1]; next} {for (i in T) {if ($0 ~ "^"i) print RS $0}}' file2 RS=">" file1