cmccabe
December 17, 2016, 10:32pm
1
In the perl
one-liner below I am identifying the runs of 6a or 6A
in each line starting with >
. The code seems close but it prints each >
line no matter if it has 6a or 6A
in it. Only the line with the 6a or 6A
needs to be printed.
So using the input
file, only the >hg19_refGene_NM_001918_3
line would be printed because it had either 6a or 6A
in it. The other lines are just skipped (not printed). Thank you :).
input
>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
gtctttgaagCTCTCCGTGGACAGGTTGTTCAGTTCAAGCTCTCAGACAT
TGGAGAAGGGATTAGAGAAGTAACTGTTAAAGAATGgtaagtgaat
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
cattttttagATTAAGCTGAGTGAAGTTGTTGGCTCAGGAAAAGATGGCA
GAATACTTAAAGAAGATATCCTCAACTATTTGGAAAAGCAGACAGGAGCT
ATATTGCCTCCTTCACCCAAAGTTGAAATTATGCCACCTCCACCAAAGCC
AAAAGACATGACTGTTCCTATACTAGTATCAAAACCTCCGGTATTCACAG
GCAAAGACAAAACAGAACCCATAAAAGgtaatgataa
current output
>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
desired output
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
perl
perl -076 -nE 'chomp; s/(.+)// && say qq{>$1}; s/\s//g; say $1 while /(a{6})/gi' input
Hello cmccabe,
Not a perl solution, in case you require you could try with following awk
once and could let me know how it goes then.
awk '(gsub(/A|a/,"&")==6 && $0 ~ /^>/)' Input_file
Output will be as follows.
>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
Thanks,
R. Singh
1 Like
Try:
awk '{h=$1; $1=x} toupper($0)~/A{6}/{print RS h}' RS=\> FS='\n' OFS= file
Produces:
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
This approach does the following:
It concatenates all the lines in the payload into one single line without gaps
Then it matches the regular expression against uppercase version of the payload
It prints the header if there is at least one match
--
Note: some older version of awk have trouble with the {}
part. In that case:
awk '{h=$1; $1=x} toupper($0)~/AAAAAA/{print RS h}' RS=\> FS='\n' OFS= file
--
Or would you like the matches printed as well?
--
I would suggest to modify your perl approach along these lines. See if that helps:
perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}'
This one prints matches of 6 or more
The original perl code:
does not test if there was a match before printing the line...
only prints with exact matches, but for instance if there is a sequence of 12 A's then it would find two matches.
1 Like
cmccabe
December 18, 2016, 8:38am
4
In the instance of:
input
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
there is a run of 7A
in bold, so if the if I am looking for 6A
, there are 6
in that sequence. However,
perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}' input
output
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAAA
desired output
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
Is there a way to print the 6A
from this run instead of the entire run of 7A
if only 6A
is being searched for? Thank you :).
Quick adaptation, try:
perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/(^|[^a])a{6}($|[^a])/i){say qq(>$h); say $2 while /(^|[^a])(a{6})($|[^a])/gi}' input
or:
perl -076 -nE '
s/(.+)//;
$h=$1;
s/\s//g;
if(/(^|[^a])a{6}($|[^a])/i) {
say qq(>$h);
say $2 while /(^|[^a])(a{6})($|[^a])/gi
}
' input
1 Like
Aia
December 18, 2016, 9:02pm
6
perl -076 -naF'\n' -e '@c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' input
1 Like
Hi Aia, that would also match 6 [Aa]'s or more. But OP in #4 is looking for matches of exactly six [Aa]'s. This approach would also match in the header and not just in the payload unlike the earlier suggestions and it does not find consecutive [Aa]'s that are on either side of a line wrap in the payload, nor would it find mixed case matches..
1 Like
Aia
December 18, 2016, 9:31pm
8
cmccabe:
In the instance of:
input
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
there is a run of 7A
in bold, so if the if I am looking for 6A
, there are 6
in that sequence. However,
perl -076 -nE 's/(.+)//; $h=$1; s/\s//g; if(/a{6}/i){say qq(>$h); say $1 while /(a{6,})/gi}' input
output
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAAA
desired output
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
Is there a way to print the 6A
from this run instead of the entire run of 7A
if only 6A
is being searched for? Thank you :).
scrutinizer:
Hi Aia, that would also match 6 [Aa]'s or more. But OP in #4 is looking for matches of exactly six [Aa]'s. This approach would also match in the header and not just in the payload unlike the earlier suggestions and it does not find consecutive [Aa]'s that are on either side of a line wrap in the payload, nor would it find mixed case matches..
Hi Scrutinizer,
Post #4 shows that the OP still would like to print matches with more than 6 but only show the match amount, this case 6. Please, see highlighted parts of that post.
Also, in fasta you are not going to find six A or a in the header.
My suggestion does not match mixed cases.
I used the following exaggerated examples to test:
cat fasta02.file
>hg19_refGene_NM_001918_2 range=chr1:100700982-100701077 5'pad=10 3'pad=10 strand=- repeatMasking=none
gtctttgaagCTCTCCGTGGACAGGTTGTTCAGTTCAAGCTCTCAGACAT
TGGAGAAGGGATTAGAGAAGTAACTGTTAAAGAATGgtaagtgaat
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAACGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_10 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcttttagGTATGTAAAAGAAGGAGATACAGTGTCTCAGTTTGATAGC
ATCTGTGAAGTTCAAAGTGATAAAGCTTCTGTTACCATCACTAGTCGTTA
TGATGGAGTCATTAAAAAACTCTATTATAATCTAGACGATATTGCCTATG
TGGGGAAGCCATTAGTAGACATAGAAAAAAaaaaaaCGGAAGCTTTAAAAGgtattgta
ag
>hg19_refGene_NM_001918_4 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_7 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAAAAAAGGGCCGAAAAACACTGGCAACTCaaaaaa
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
>hg19_refGene_NM_001918_5 range=chr1:100681529-100681765 5'pad=10 3'pad=10 strand=- repeatMasking=none
cattttttagATTAAGCTGAGTGAAGTTGTTGGCTCAGGAAAAGATGGCA
GAATACTTAAAGAAGATATCCTCAACTATTTGGAAAAGCAGACAGGAGCT
ATATTGCCTCCTTCACCCAAAGTTGAAATTATGCCACCTCCACCAAAGCC
AAAAGACATGACTGTTCCTATACTAGTATCAAAACCTCCGGTATTCACAG
GCAAAGACAAAACAGAACCCATAAAAGgtaatgataa
>hg19_refGene_NM_001918_8 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
TGATGAACATACACACCAAGAGATAAaaaAAGGGCCGAAAAACACTGGCAACTC
CTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
Here's the result:
perl -076 -naF'\n' -e '@c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"' fasta02.file
>hg19_refGene_NM_001918_3 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA
>hg19_refGene_NM_001918_10 range=chr1:100696279-100696480 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA AAAAAA aaaaaa
>hg19_refGene_NM_001918_7 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
AAAAAA aaaaaa
Let's fix the problem of missing possible matches if found in multiple lines:
perl -076 -naF'\n' -e 's/\n+//g; @c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"'
1 Like
Hi Aia, OK I think you may be right about your interpretation of only printing 6 A's when there would be 7 consecutive A's. That would leave an interesting question of what to do when there are a mutiple of 6 [Aa]'s or more. Should that then also only be printed once?
I am not sure about multiple [Aa]'s under no circumstances present in the FASTA header. Unlikely sure, but impossible? The OP's original approach excludes that from matching..
The mixed case I think could be mitigated by using:
@c=/a{6}/ig
But this also leaves the line wrap case; this approach does not find consecutive [Aa]'s that are on either side of a line wrap in the payload. So the newlines would need to be removed in the payload before the matching.
>hg19_refGene_NM_001918_9 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
ATGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTCAAAA
AACTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
1 Like
Aia
December 18, 2016, 10:25pm
10
aia:
Hi Scrutinizer,
Post #4 shows that the OP still would like to print matches with more than 6 but only show the match amount, this case 6. Please, see highlighted parts of that post.
Also, in fasta you are not going to find six A or a in the header.
My suggestion does not match mixed cases.
[...]
Let's fix the problem of missing possible matches if found in multiple lines:
perl -076 -naF'\n' -e 's/\n+//g; @c=/(A{6}|a{6})/g and print ">$F[0]\n@c\n"'
scrutinizer:
Hi Aia, OK I think you may be right about your interpretation of only printing 6 A's when there would be 7 consecutive A's. That would leave an interesting question of what to do when there are a mutiple of 6 [Aa]'s or more. Should that then also only be printed once?
I am not sure about multiple [Aa]'s under no circumstances present in the FASTA header. Unlikely sure, but impossible? The OP's original approach excludes that from matching..
The mixed case I think could be mitigated by using:
@c=/a{6}/ig
But this also leaves the line wrap case; this approach does not find consecutive [Aa]'s that are on either side of a line wrap in the payload. So the newlines would need to be removed in the payload before the matching.
>hg19_refGene_NM_001918_9 range=chr1:100684172-100684313 5'pad=10 3'pad=10 strand=- repeatMasking=none
ttgttaccagATTCAGAAGAAGATGTTGTTGAAACTCCTGCAGTGTCTCA
ATGATGAACATACACACCAAGAGATAAAGGGCCGAAAAACACTGGCAACTCAAAA
AACTGCAGTTCGCCGTCTGGCAATGGAAAACAATgtaagttctc
Hi, Scrutinizer,
I did post a fix for the case of multiple lines, previously. Please, see post #8 .
On purpose, I did want not this to be that case, by design. If I error in this case, that adaptation might suffices.
1 Like
cmccabe
December 19, 2016, 8:37am
11
Thank you all for your help and explanations, I really appreciate it :).