Perl to identify specific runs in input and print only lines identified

cmccabe · December 17, 2016, 10:32pm

In the perl one-liner below I am identifying the runs of 6a or 6A in each line starting with > . The code seems close but it prints each > line no matter if it has 6a or 6A in it. Only the line with the 6a or 6A needs to be printed.

So using the input file, only the >hg19_refGene_NM_001918_3 line would be printed because it had either 6a or 6A in it. The other lines are just skipped (not printed). Thank you :).

input

>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
gtctttgaagCTCTCCGTGGACAGGTTGTTCAGTTCAAGCTCTCAGACAT
TGGAGAAGGGATTAGAGAAGTAACTGTTAAAGAATGgtaagtgaat
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
cattttttagATTAAGCTGAGTGAAGTTGTTGGCTCAGGAAAAGATGGCA
GAATACTTAAAGAAGATATCCTCAACTATTTGGAAAAGCAGACAGGAGCT
ATATTGCCTCCTTCACCCAAAGTTGAAATTATGCCACCTCCACCAAAGCC
AAAAGACATGACTGTTCCTATACTAGTATCAAAACCTCCGGTATTCACAG
GCAAAGACAAAACAGAACCCATAAAAGgtaatgataa

current output

>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none

desired output

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA

perl

perl -076 -nE 'chomp; s/(.+)// && say qq{>$1}; s/\s//g; say $1 while /(a{6})/gi' input

RavinderSingh13 · December 18, 2016, 12:54am

Hello cmccabe,

Not a perl solution, in case you require you could try with following awk once and could let me know how it goes then.

awk '(gsub(/A|a/,"&")==6 && $0 ~ /^>/)'  Input_file

Output will be as follows.

>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none

Thanks,
R. Singh

Scrutinizer · December 18, 2016, 3:40am

Try:

awk '{h=$1; $1=x} toupper($0)~/A{6}/{print RS h}' RS=\> FS='\n' OFS= file

Produces:

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none

This approach does the following:

It concatenates all the lines in the payload into one single line without gaps
Then it matches the regular expression against uppercase version of the payload
It prints the header if there is at least one match

--
Note: some older version of awk have trouble with the {} part. In that case:

awk '{h=$1; $1=x} toupper($0)~/AAAAAA/{print RS h}' RS=\> FS='\n' OFS= file

--
Or would you like the matches printed as well?

--

I would suggest to modify your perl approach along these lines. See if that helps:

perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}'

This one prints matches of 6 or more

The original perl code:

does not test if there was a match before printing the line...
only prints with exact matches, but for instance if there is a sequence of 12 A's then it would find two matches.

cmccabe · December 18, 2016, 8:38am

In the instance of:

input

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta

there is a run of 7A in bold, so if the if I am looking for 6A , there are 6 in that sequence. However,

perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}' input

output

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAAA

desired output

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA

Is there a way to print the 6A from this run instead of the entire run of 7A if only 6A is being searched for? Thank you :).

Scrutinizer · December 18, 2016, 9:19am

Quick adaptation, try:

perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/(^|[^a])a{6}($|[^a])/i){say qq(>$h); say $2 while /(^|[^a])(a{6})($|[^a])/gi}' input

or:

perl -076 -nE '
  s/(.+)//;
  $h=$1;
  s/\s//g;
  if(/(^|[^a])a{6}($|[^a])/i) {
    say qq(>$h);
    say $2 while /(^|[^a])(a{6})($|[^a])/gi
  }
' input

Aia · December 18, 2016, 9:02pm

perl -076 -naF'\n' -e '@c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' input

Scrutinizer · December 18, 2016, 9:17pm

Hi Aia, that would also match 6 [Aa]'s or more. But OP in #4 is looking for matches of exactly six [Aa]'s. This approach would also match in the header and not just in the payload unlike the earlier suggestions and it does not find consecutive [Aa]'s that are on either side of a line wrap in the payload, nor would it find mixed case matches..

Aia · December 18, 2016, 9:31pm

cmccabe:

In the instance of:

input

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta

there is a run of 7A in bold, so if the if I am looking for 6A , there are 6 in that sequence. However,

perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}' input

output

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAAA

desired output

>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA

Is there a way to print the 6A from this run instead of the entire run of 7A if only 6A is being searched for? Thank you :).

Hi Scrutinizer,
Post #4 shows that the OP still would like to print matches with more than 6 but only show the match amount, this case 6. Please, see highlighted parts of that post.
Also, in fasta you are not going to find six A or a in the header.
My suggestion does not match mixed cases.

I used the following exaggerated examples to test:

cat fasta02.file

>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
gtctttgaagCTCTCCGTGGACAGGTTGTTCAGTTCAAGCTCTCAGACAT
TGGAGAAGGGATTAGAGAAGTAACTGTTAAAGAATGgtaagtgaat
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_10 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAAAAAaaaaaaCGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_7 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAAAAAGGGCCGAAAAACACTGGCAACTCaaaaaa
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
cattttttagATTAAGCTGAGTGAAGTTGTTGGCTCAGGAAAAGATGGCA
GAATACTTAAAGAAGATATCCTCAACTATTTGGAAAAGCAGACAGGAGCT
ATATTGCCTCCTTCACCCAAAGTTGAAATTATGCCACCTCCACCAAAGCC
AAAAGACATGACTGTTCCTATACTAGTATCAAAACCTCCGGTATTCACAG
GCAAAGACAAAACAGAACCCATAAAAGgtaatgataa
>hg19_refGene_NM_001918_8 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAaaaAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc

Here's the result:

perl -076 -naF'\n' -e '@c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' fasta02.file
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
>hg19_refGene_NM_001918_10 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA AAAAAA aaaaaa
>hg19_refGene_NM_001918_7 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA aaaaaa

Let's fix the problem of missing possible matches if found in multiple lines:

perl -076 -naF'\n' -e 's/\n+//g; @c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"'

Scrutinizer · December 18, 2016, 10:10pm

Hi Aia, OK I think you may be right about your interpretation of only printing 6 A's when there would be 7 consecutive A's. That would leave an interesting question of what to do when there are a mutiple of 6 [Aa]'s or more. Should that then also only be printed once?

I am not sure about multiple [Aa]'s under no circumstances present in the FASTA header. Unlikely sure, but impossible? The OP's original approach excludes that from matching..

The mixed case I think could be mitigated by using:

@c=/a{6}/ig

But this also leaves the line wrap case; this approach does not find consecutive [Aa]'s that are on either side of a line wrap in the payload. So the newlines would need to be removed in the payload before the matching.

>hg19_refGene_NM_001918_9 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
ATGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTCAAAA
AACTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc

Aia · December 18, 2016, 10:25pm

aia:

Hi Scrutinizer,
Post #4 shows that the OP still would like to print matches with more than 6 but only show the match amount, this case 6. Please, see highlighted parts of that post.
Also, in fasta you are not going to find six A or a in the header.
My suggestion does not match mixed cases.

[...]

Let's fix the problem of missing possible matches if found in multiple lines:
perl -076 -naF'\n' -e 's/\n+//g; @c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' 

scrutinizer:

Hi Aia, OK I think you may be right about your interpretation of only printing 6 A's when there would be 7 consecutive A's. That would leave an interesting question of what to do when there are a mutiple of 6 [Aa]'s or more. Should that then also only be printed once?

I am not sure about multiple [Aa]'s under no circumstances present in the FASTA header. Unlikely sure, but impossible? The OP's original approach excludes that from matching..

The mixed case I think could be mitigated by using:
@c=/a{6}/ig
But this also leaves the line wrap case; this approach does not find consecutive [Aa]'s that are on either side of a line wrap in the payload. So the newlines would need to be removed in the payload before the matching.
>hg19_refGene_NM_001918_9 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
ATGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTCAAAA
AACTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc

Hi, Scrutinizer,
I did post a fix for the case of multiple lines, previously. Please, see post #8.

On purpose, I did want not this to be that case, by design. If I error in this case, that adaptation might suffices.

cmccabe · December 19, 2016, 8:37am

Thank you all for your help and explanations, I really appreciate it :).