Perl to identify specific runs in input and print only lines identified

In the perl one-liner below I am identifying the runs of 6a or 6A in each line starting with > . The code seems close but it prints each > line no matter if it has 6a or 6A in it. Only the line with the 6a or 6A needs to be printed.

So using the input file, only the >hg19_refGene_NM_001918_3 line would be printed because it had either 6a or 6A in it. The other lines are just skipped (not printed). Thank you :).

input

>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
gtctttgaagCTCTCCGTGGACAGGTTGTTCAGTTCAAGCTCTCAGACAT
TGGAGAAGGGATTAGAGAAGTAACTGTTAAAGAATGgtaagtgaat
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
cattttttagATTAAGCTGAGTGAAGTTGTTGGCTCAGGAAAAGATGGCA
GAATACTTAAAGAAGATATCCTCAACTATTTGGAAAAGCAGACAGGAGCT
ATATTGCCTCCTTCACCCAAAGTTGAAATTATGCCACCTCCACCAAAGCC
AAAAGACATGACTGTTCCTATACTAGTATCAAAACCTCCGGTATTCACAG
GCAAAGACAAAACAGAACCCATAAAAGgtaatgataa

current output

>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none

desired output

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA

perl

perl -076 -nE 'chomp; s/(.+)// && say qq{>$1}; s/\s//g; say $1 while /(a{6})/gi' input

Hello cmccabe,

Not a perl solution, in case you require you could try with following awk once and could let me know how it goes then.

awk '(gsub(/A|a/,"&")==6 && $0 ~ /^>/)'  Input_file

Output will be as follows.

>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none

Thanks,
R. Singh

1 Like

Try:

awk '{h=$1; $1=x} toupper($0)~/A{6}/{print RS h}' RS=\> FS='\n' OFS= file

Produces:

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none

This approach does the following:

  • It concatenates all the lines in the payload into one single line without gaps
  • Then it matches the regular expression against uppercase version of the payload
  • It prints the header if there is at least one match

--
Note: some older version of awk have trouble with the {} part. In that case:

awk '{h=$1; $1=x} toupper($0)~/AAAAAA/{print RS h}' RS=\> FS='\n' OFS= file

--
Or would you like the matches printed as well?

--

I would suggest to modify your perl approach along these lines. See if that helps:

perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}'

This one prints matches of 6 or more

The original perl code:

  • does not test if there was a match before printing the line...
  • only prints with exact matches, but for instance if there is a sequence of 12 A's then it would find two matches.
1 Like

In the instance of:

input

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta

there is a run of 7A in bold, so if the if I am looking for 6A , there are 6 in that sequence. However,

perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}' input

output

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAAA

desired output

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA

Is there a way to print the 6A from this run instead of the entire run of 7A if only 6A is being searched for? Thank you :).

Quick adaptation, try:

perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/(^|[^a])a{6}($|[^a])/i){say qq(>$h); say $2 while /(^|[^a])(a{6})($|[^a])/gi}' input

or:

perl -076 -nE '
  s/(.+)//;
  $h=$1;
  s/\s//g;
  if(/(^|[^a])a{6}($|[^a])/i) {
    say qq(>$h);
    say $2 while /(^|[^a])(a{6})($|[^a])/gi
  }
' input
1 Like
perl -076 -naF'\n' -e '@c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' input
1 Like

Hi Aia, that would also match 6 [Aa]'s or more. But OP in #4 is looking for matches of exactly six [Aa]'s. This approach would also match in the header and not just in the payload unlike the earlier suggestions and it does not find consecutive [Aa]'s that are on either side of a line wrap in the payload, nor would it find mixed case matches..

1 Like

Hi Scrutinizer,
Post #4 shows that the OP still would like to print matches with more than 6 but only show the match amount, this case 6. Please, see highlighted parts of that post.
Also, in fasta you are not going to find six A or a in the header.
My suggestion does not match mixed cases.

I used the following exaggerated examples to test:

cat fasta02.file

>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
gtctttgaagCTCTCCGTGGACAGGTTGTTCAGTTCAAGCTCTCAGACAT
TGGAGAAGGGATTAGAGAAGTAACTGTTAAAGAATGgtaagtgaat
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_10 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAAAAAaaaaaaCGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_7 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAAAAAGGGCCGAAAAACACTGGCAACTCaaaaaa
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
cattttttagATTAAGCTGAGTGAAGTTGTTGGCTCAGGAAAAGATGGCA
GAATACTTAAAGAAGATATCCTCAACTATTTGGAAAAGCAGACAGGAGCT
ATATTGCCTCCTTCACCCAAAGTTGAAATTATGCCACCTCCACCAAAGCC
AAAAGACATGACTGTTCCTATACTAGTATCAAAACCTCCGGTATTCACAG
GCAAAGACAAAACAGAACCCATAAAAGgtaatgataa
>hg19_refGene_NM_001918_8 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAaaaAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc

Here's the result:

perl -076 -naF'\n' -e '@c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' fasta02.file
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
>hg19_refGene_NM_001918_10 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA AAAAAA aaaaaa
>hg19_refGene_NM_001918_7 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA aaaaaa

Let's fix the problem of missing possible matches if found in multiple lines:

perl -076 -naF'\n' -e 's/\n+//g; @c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' 
1 Like

Hi Aia, OK I think you may be right about your interpretation of only printing 6 A's when there would be 7 consecutive A's. That would leave an interesting question of what to do when there are a mutiple of 6 [Aa]'s or more. Should that then also only be printed once?

I am not sure about multiple [Aa]'s under no circumstances present in the FASTA header. Unlikely sure, but impossible? The OP's original approach excludes that from matching..

The mixed case I think could be mitigated by using:

@c=/a{6}/ig

But this also leaves the line wrap case; this approach does not find consecutive [Aa]'s that are on either side of a line wrap in the payload. So the newlines would need to be removed in the payload before the matching.

>hg19_refGene_NM_001918_9 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
ATGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTCAAAA
AACTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
1 Like

Hi, Scrutinizer,
I did post a fix for the case of multiple lines, previously. Please, see post #8.

On purpose, I did want not this to be that case, by design. If I error in this case, that adaptation might suffices.

1 Like

Thank you all for your help and explanations, I really appreciate it :).