Xterra
January 13, 2019, 5:31pm
1
I have this fastq file:
@M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
GGGGGGGGGGGGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCA
+test-1
GGGGGGGGGGGGGGGGGCCGGGGGFF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8
@M04961:22:000000000-B5VGJ:1:1121:9280:7106 1:N:0:86
GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
+test 2
GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;@8F
@M04961:22:000000000-B5VGJ:1:1151:9280:7106 1:N:0:86
GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
+more tests
GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;+8F
@M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACATTTTT
+few more
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG+GGGG
And I am using this script to output the three characters after this sting GCATGAAAACATACA
:
sed -n '/^@/{n;s/.*GCATGAAAACATACA\(...\).*/Codon:\t\1\tQuality Score: /p}'
This is my current output:
Codon: AAC Quality Score:
Codon: AAC Quality Score:
Codon: AAC Quality Score:
Codon: TTT Quality Score:
However, I would like to output the three characters in the same position two rows below from each sequence, something like this:
Codon: AAC Quality Score: ,ED
Codon: AAC Quality Score: GCC
Codon: AAC Quality Score: GCC
Codon: TTT Quality Score: +GG
Is there a way to accomplish this with sed
?
operative system W10; shell cygwin
There might be a way to do it with sed
, but it would be a lot easier with awk
. Don't you have access to awk
in cygwin?
P.S. Note that cygwin is not a shell; it is a collection of tools that mimic several common tools found on many BSD-, Linux-, and UNIX-systems. The default shell used when running cygwin is usually bash
.
Xterra
January 13, 2019, 8:23pm
3
I do have access to awk
. I would be interested on seeing a solution with awk
too.
There are more efficient ways to do this, but this seems to do what you want:
awk '
BEGIN { String = "GCATGAAAACATACA"
StringLen = length(String)
}
/^@/ { matchline = NR + 1
qualityline = NR + 3
next
}
NR == matchline {
if(spot = index($0, String))
printf("Codon:\t%s\tQuality Score:\t",
substr($0, spot + StringLen, 3))
else qualityline = 0
next
}
NR == qualityline {
printf("%s\n", substr($0, spot + StringLen, 3))
}' file
with the sample data you provided contained in a file named file
, this produces the output:
Codon: AAC Quality Score: ,ED
Codon: AAC Quality Score: GCC
Codon: AAC Quality Score: GCC
Codon: TTT Quality Score: +GG
2 Likes
This is probably easier to read and produces the same output:
awk '
BEGIN { String = "GCATGAAAACATACA"
StringLen = length(String)
}
/^@/ { getline CodonLine
getline
getline QualityLine
if(spot = index(CodonLine, String))
printf("Codon:\t%s\tQuality Score:\t%s\n",
substr(CodonLine, spot + StringLen, 3),
substr(QualityLine, spot + StringLen, 3))
}' file
2 Likes
Xterra
January 13, 2019, 10:29pm
6
Don
Thanks!
PS. If I would like to search for more than one string (GCATGAAAACATACA and TTTCCAGAAATTGT)
and report different number characters (3 and 6
. I should be able to do it passing the strings and number of charters as variables, right?
awk -vlen="3 6" -vstr="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN { for (MX=n=split (str, TMP); n>0; n--) SRCH[TMP[n]] = n
String = n
StringLen = length(String)
}
/^@/ { matchline = NR + 1
qualityline = NR + 3
next
}
NR == matchline {
if(spot = index($0, String))
printf("Codon:\t%s\tQuality Score:\t",
substr($0, spot + StringLen, len))
else qualityline = 0
next
}
NR == qualityline {
printf("%s\n", substr($0, spot + StringLen, len))
}' test.txt
1 Like
xterra:
Don
Thanks!
PS. If I would like to search for more than one string (GCATGAAAACATACA and TTTCCAGAAATTGT)
and report different number characters (3 and 6
. I should be able to do it passing the strings and number of charters as variables, right?
awk -vlen="3 6" -vstr="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN { for (MX=n=split (str, TMP); n>0; n--) SRCH[TMP[n]] = n
String = n
StringLen = length(String)
}
/^@/ { matchline = NR + 1
qualityline = NR + 3
next
}
NR == matchline {
if(spot = index($0, String))
printf("Codon:\t%s\tQuality Score:\t",
substr($0, spot + StringLen, len))
else qualityline = 0
next
}
NR == qualityline {
printf("%s\n", substr($0, spot + StringLen, len))
}' test.txt
I'm pretty sure that the code you have shown us above won't do what you want to do, but from your new description I'm not sure what it is that you're trying to do.
The String
variable in my code needs to be a string; not a number representing an index into the TMP[]
and/or SRC[]
arrays. If you're looking for multiple strings on the 2nd line of each group of input lines you're processing, you need to perform more than one index()
operation to search for those strings.
If you could give us a clearer description and show us the output you hope to produce from your new requirements, maybe we could help.
1 Like
Making a few wild guesses... If the output you're trying to produce is:
Codon: AAC Quality Score: ,ED
Codon: TCCAAG Quality Score: G7DCGG
Codon: AAC Quality Score: GCC
Codon: TCCAAG Quality Score: DGGCGG
Codon: AAC Quality Score: GCC
Codon: TCCAAG Quality Score: DGGCGG
Codon: TTT Quality Score: +GG
you could try something like:
awk -v lengths="3 6" -v strings="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN { nString = split(strings, String)
split(lengths, OutLen)
for(i = 1; i <= nString; i++)
StringLen = length(String)
}
/^@/ { getline CodonLine
getline
getline QualityLine
for(i = 1; i <= nString; i++)
if(spot = index(CodonLine, String))
printf("Codon:\t%s\tQuality Score:\t%s\n",
substr(CodonLine, spot + StringLen, OutLen),
substr(QualityLine, spot + StringLen, OutLen))
}' file
2 Likes
It was interesting to try to implement this algorithm on the sed
#!/bin/sed -nrf
2~4 h
4~4 {
H;x
s/(.*GCATGAAAACATACA.{3})(.*)/\1\r\2/
}
/\r/ {
:1
s/^.(.{2}[^\r].*)/\1/
T2
s/(\n).(.*)/\1\2/
t1
:2
s/^(.{3}).*/\1/mg
s/(.*)\n(.*)/Codon:\t\1\tQuality Score:\t \2/p
}
2 Likes
Xterra
January 14, 2019, 2:55pm
11
Don
I modified a bit your script to output the total count and give some format:
awk -v gene="gene-a gene-b" -v lengths="3 6" -v strings="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN { nString = split(strings, String)
split(lengths, OutLen)
split(gene, Id)
for(i = 1; i <= nString; i++)
StringLen = length(String)
}
/^@/ { getline CodonLine
getline
getline QualityLine
for(i = 1; i <= nString; i++)
if(spot = index(CodonLine, String))
printf("Gene:\t"Id"\tCodon:\t%s\t\tQuality Score:\t%s\t\n",
substr(CodonLine, spot + StringLen, OutLen),
substr(QualityLine, spot + StringLen, OutLen))
}' test.txt | awk '{ count[$0]++ } END {{ print "\n\t\t\t\tSummary\n#############################################################################\nCount\t\tGene\t\tCodon\t\t\tQuality Score\n" } {for (gene in count ) print count[gene] "\t" gene | "sort -k 3"}}'
With the above script I am getting the desired output:
Summary
#############################################################################
Count Gene Codon Quality Score
1 Gene: gene-a Codon: AAC Quality Score: ,ED
2 Gene: gene-a Codon: AAC Quality Score: GCC
1 Gene: gene-a Codon: TTT Quality Score: +GG
2 Gene: gene-b Codon: TCCAAG Quality Score: DGGCGG
1 Gene: gene-b Codon: TCCAAG Quality Score: G7DCGG
However, I tried to include the END step in your awk
script fail miserably. How can I modify the script so I don't have to "stitch" together the two scripts as shown above?
Thanks!
1 Like
Hi Xterra,
Maybe something like:
awk -v genes="gene-a gene-b" \
-v lengths="3 6" \
-v strings="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN { nString = split(strings, String)
split(lengths, SLen)
split(genes, Id)
for(i = 1; i <= nString; i++)
StringLen = length(String)
sort_cmd = "sort -k3,3 -k5,5 -k8,8"
print "\n\t\t\t\tSummary"
print "#############################################################" \
"################"
print "Count\t\tGene\t\tCodon\t\t\tQuality Score\n"
}
/^@/ { getline CodonLine
getline
getline QualityLine
for(i = 1; i <= nString; i++)
if(spot = index(CodonLine, String))
count[sprintf( \
"Gene:\t%s\tCodon:\t%s\tQuality Score:\t%s",
Id,
substr(CodonLine, spot + StringLen, SLen),
substr(QualityLine, spot + StringLen, SLen)) \
]++
}
END { for(i in count)
printf("%d\t%s\n", count, i) | sort_cmd
}' test.txt
which produces the output:
Summary
#############################################################################
Count Gene Codon Quality Score
1 Gene: gene-a Codon: AAC Quality Score: ,ED
2 Gene: gene-a Codon: AAC Quality Score: GCC
1 Gene: gene-a Codon: TTT Quality Score: +GG
2 Gene: gene-b Codon: TCCAAG Quality Score: DGGCGG
1 Gene: gene-b Codon: TCCAAG Quality Score: G7DCGG
I'm sure you could write this as a 1-liner, but I much prefer something I can see on a screen (and debug).
If there's anything here you can't figure out, ask questions about what you don't understand.
Hope this helps,
Don
2 Likes