I want to extract the data under BA and B8_ or any such pairwise combinations and write it into a new file (below) with the same format and also putting the first line as it is in the new file (with same spaces and all
I have not tested the solution of Franklin52 but there is a subtle difference between your first post and your last one. In the first one each header begins with '>' but not in the last one.
I will make a try, but parsing your file, how can I know where each header begins or ends? I suppose each header is less than 20 characters while normal lines are above that number, but I may be wrong.
$ cat infile
(data of your last post)
$ cat script.pl
use warnings;
use strict;
use constant HEADER_LINE_LENGTH => 20;
die "Usage: perl $0 <input-file> <output-file> <headers>\n" unless @ARGV > 2;
my $infile = shift;
my $outfile = shift;
my %header = map { $_ => 1 } @ARGV;
open my $fh, "<", $infile or die "Cannot open file $infile: $!\n";
open my $ofh, ">", $outfile or die "Cannot open file $outfile: $!\n";
while ( my $line = <$fh> ) {
chomp $line;
if ( my $flip = ( exists $header{ $line } ... length( $line ) < HEADER_LINE_LENGTH ) ) {
if ( $flip =~ /E/ ) {
redo;
} else {
printf $ofh "%s\n", $line;
}
}
}
close $fh or warn "Cannot close $infile: $!\n";
close $ofh or warn "Cannot close $outfile: $!\n";
$ perl script.pl
Usage: perl script.pl <input-file> <output-file> <headers>
$ perl script.pl infile outfile BA BC BC23_
$ cat outfile
BA
GTATACATTATTGATGAAGTCCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATCATTTCGCGTTGCCAACGTTTCGAATTTCGAAAAATATCAGTAAAT
GATATTGTTGAGAGATTGTCCACGGTTGTGACTAATGAAGGTACGCAAGTAGAAGATGAG
GCTTTACAAATTGTTGCGCGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAAGCGATATCTTATAGTGATGAGAGGGTTACGACAGAAGATGTATTAGCTGTAACG
GGTCGTGATATGTTCCGTATGTTAAGTGAA
BC23_
GTATACATTATTGATGAAGTTCACATGCTTTCTATGGGTGCATTCAATGCGCTTTTAAAA
ACCTTAGAAGAGCCGCCAGGACATGTTATCTTTATTTTGGCGACAACAGAACCTCATAAG
ATCCCACCTACAATCATTTCACGTTGTCAGCGCTTTGAATTCCGAAAAATATCAGTGAAT
GATATTGTTGAGAGATTATCAACGGTCGTGACAAATGAAGGTACGCAAGTGGAAGGTGAA
GCATTACAAATTGTTGCGCGTGCTGCCGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAGGCTATATCTTATAGTGATGAGATTGTTACGACAGAAGATGTATTGGCCGTAACA
GGACGTGATATGTTCCGTAAGTTGAGTGAA
BC
GTATACATTATTGATGAAGTTCACATGCTTTCTATGGGTGCCTTCAATGCGCTTTTAAAA
ACGTTAGAAGAACCGCCAGGACATGTCATCTTTATTTTGGCGACAACAGAACCGCATAAG
ATACCGCCTACAATTATTTCGCGTTGCCAACGTTTCGAATTTCGAAAGATATCAGTAAAT
GATATTGTTGAGAGATTATCGACAGTTGTAAACAATGAAGGTACGCAAGTAGAAGATGAA
GCGTTACAAATCGTTGCACGTGCCGCTGAAGGTGGTATGCGTGATGCGCTTAGTCTTATT
GATCAGGCAATATCTTATAGTGATGAGACTGTTACGACAGAAGATGTATTAGCTGTAACA
GGGCGTGATATGTTCCGAATGTTAAGTGAA