I have the following file with @M
at the beginning of the line as a RS:
@M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
+
GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
@M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
@M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
+
CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
@M04961:22:000000000-B5VGJ:1:1101:10817:7690 1:N:0:86
ACGAGCATCATCTTGATTAAGCTCATTAGGGTTAGCCTCGGTACGGTCAGGCATCCACGGCGCTTTAAAATAGTTGTTATAGATATTCAAATAACCCTGA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEEFFFGGGGGG
@M04961:22:000000000-B5VGJ:1:1101:10091:7763 1:N:0:86
GAGCACATTGTAGCATTGTGCCAATTCATCCATTAACTTCTCAGTAACAGATACAAACTCATCACGAACGTCAGAAGCAGCCTTATGGCCGTCAACATAC
+
:=@FGEFFFGGGGGGGFBB@BEFGG?F,EFCCF@FGGGGGGECFGFG9,><3>FC@DFFGG9:383@FC9,>;,>78FC=FCDECFFDGFFCFFGGC?FF
@M04961:22:000000000-B5VGJ:1:1101:14783:7784 1:N:0:86
TCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGCGGGGFGG
@M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
And I am using the following script to extract sequences with a specific string
( GGCATGAAAACATACA
):
awk -vRS="@M" '/GGCATGAAAACATACA/ { print "@M"$0 }' infile
The problem I have is that the string
should be at the beginning of the second line. Thus, the desire output file should include only three records:
@M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
+
GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
@M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
@M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
+
CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
My script, however, is outputting an extra record containing the string somewhere in the middle of the second line and a blank line between each record:
@M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
+
GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
@M04961:22:000000000-B5VGJ:1:1101:14258:7136 1:N:0:86
GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
@M04961:22:000000000-B5VGJ:1:1101:15671:7305 1:N:0:86
GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
+
CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
@M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
I tried adding ^
to the string
( /^GGCATGAAAACATACA/
), but that obviously does not work.
Any help will be greatly appreciated!
PS. Ideally I would like to use | sed '/^$/d'
to eliminate the blank lines if at all possible