Outputting characters after a given string and reporting the characters in the row below --sed

Xterra · January 13, 2019, 5:31pm

I have this fastq file:

@M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
GGGGGGGGGGGGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCA
+test-1
GGGGGGGGGGGGGGGGGCCGGGGGFF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8
@M04961:22:000000000-B5VGJ:1:1121:9280:7106 1:N:0:86
GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
+test 2
GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;@8F
@M04961:22:000000000-B5VGJ:1:1151:9280:7106 1:N:0:86
GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
+more tests
GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;+8F
@M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACATTTTT
+few more
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG+GGGG

And I am using this script to output the three characters after this sting GCATGAAAACATACA :

sed -n '/^@/{n;s/.*GCATGAAAACATACA\(...\).*/Codon:\t\1\tQuality Score: /p}'

This is my current output:

Codon:  AAC     Quality Score:
Codon:  AAC     Quality Score:
Codon:  AAC     Quality Score:
Codon:  TTT     Quality Score:

However, I would like to output the three characters in the same position two rows below from each sequence, something like this:

Codon:  AAC     Quality Score: ,ED
Codon:  AAC     Quality Score: GCC
Codon:  AAC     Quality Score: GCC
Codon:  TTT     Quality Score: +GG

Is there a way to accomplish this with sed ?
operative system W10; shell cygwin

Don_Cragun · January 13, 2019, 8:13pm

There might be a way to do it with sed , but it would be a lot easier with awk . Don't you have access to awk in cygwin?

P.S. Note that cygwin is not a shell; it is a collection of tools that mimic several common tools found on many BSD-, Linux-, and UNIX-systems. The default shell used when running cygwin is usually bash .

Xterra · January 13, 2019, 8:23pm

I do have access to awk . I would be interested on seeing a solution with awk too.

Don_Cragun · January 13, 2019, 9:29pm

There are more efficient ways to do this, but this seems to do what you want:

awk '
BEGIN {	String = "GCATGAAAACATACA"
	StringLen = length(String)
}
/^@/ {	matchline = NR + 1
	qualityline = NR + 3
	next
}
NR == matchline {
	if(spot = index($0, String))
		printf("Codon:\t%s\tQuality Score:\t",
		    substr($0, spot + StringLen, 3))
	else	qualityline = 0
	next
}
NR == qualityline {
	printf("%s\n", substr($0, spot + StringLen, 3))
}' file

with the sample data you provided contained in a file named file , this produces the output:

Codon:	AAC	Quality Score:	,ED
Codon:	AAC	Quality Score:	GCC
Codon:	AAC	Quality Score:	GCC
Codon:	TTT	Quality Score:	+GG

Don_Cragun · January 13, 2019, 10:23pm

This is probably easier to read and produces the same output:

awk '
BEGIN {	String = "GCATGAAAACATACA"
	StringLen = length(String)
}
/^@/ {	getline CodonLine
	getline
	getline QualityLine
	if(spot = index(CodonLine, String))
		printf("Codon:\t%s\tQuality Score:\t%s\n",
		    substr(CodonLine, spot + StringLen, 3),
		    substr(QualityLine, spot + StringLen, 3))
}' file

Xterra · January 13, 2019, 10:29pm

Don
Thanks!
PS. If I would like to search for more than one string (GCATGAAAACATACA and TTTCCAGAAATTGT) and report different number characters (3 and 6 . I should be able to do it passing the strings and number of charters as variables, right?

awk -vlen="3 6" -vstr="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN {	for (MX=n=split (str, TMP); n>0; n--) SRCH[TMP[n]] = n
	String = n
	StringLen = length(String)
}
/^@/ {	matchline = NR + 1
	qualityline = NR + 3
	next
}
NR == matchline {
	if(spot = index($0, String))
		printf("Codon:\t%s\tQuality Score:\t",
		    substr($0, spot + StringLen, len))
	else	qualityline = 0
	next
}
NR == qualityline {
	printf("%s\n", substr($0, spot + StringLen, len))
}' test.txt

Don_Cragun · January 14, 2019, 2:20am

xterra:

Don
Thanks!
PS. If I would like to search for more than one string (GCATGAAAACATACA and TTTCCAGAAATTGT) and report different number characters (3 and 6 . I should be able to do it passing the strings and number of charters as variables, right?
awk -vlen="3 6" -vstr="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN {	for (MX=n=split (str, TMP); n>0; n--) SRCH[TMP[n]] = n
	String = n
	StringLen = length(String)
}
/^@/ {	matchline = NR + 1
	qualityline = NR + 3
	next
}
NR == matchline {
	if(spot = index($0, String))
		printf("Codon:\t%s\tQuality Score:\t",
		    substr($0, spot + StringLen, len))
	else	qualityline = 0
	next
}
NR == qualityline {
	printf("%s\n", substr($0, spot + StringLen, len))
}' test.txt

I'm pretty sure that the code you have shown us above won't do what you want to do, but from your new description I'm not sure what it is that you're trying to do.

The String variable in my code needs to be a string; not a number representing an index into the TMP[] and/or SRC[] arrays. If you're looking for multiple strings on the 2nd line of each group of input lines you're processing, you need to perform more than one index() operation to search for those strings.

If you could give us a clearer description and show us the output you hope to produce from your new requirements, maybe we could help.

Don_Cragun · January 14, 2019, 3:28am

Making a few wild guesses... If the output you're trying to produce is:

Codon:	AAC	Quality Score:	,ED
Codon:	TCCAAG	Quality Score:	G7DCGG
Codon:	AAC	Quality Score:	GCC
Codon:	TCCAAG	Quality Score:	DGGCGG
Codon:	AAC	Quality Score:	GCC
Codon:	TCCAAG	Quality Score:	DGGCGG
Codon:	TTT	Quality Score:	+GG

you could try something like:

awk -v lengths="3 6" -v strings="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN {	nString = split(strings, String)
	split(lengths, OutLen)
	for(i = 1; i <= nString; i++)
		StringLen = length(String)
}
/^@/ {	getline CodonLine
	getline
	getline QualityLine
	for(i = 1; i <= nString; i++)
		if(spot = index(CodonLine, String))
			printf("Codon:\t%s\tQuality Score:\t%s\n",
			    substr(CodonLine, spot + StringLen, OutLen),
			    substr(QualityLine, spot + StringLen, OutLen))
}' file

nezabudka · January 14, 2019, 4:23am

It was interesting to try to implement this algorithm on the sed

#!/bin/sed -nrf
2~4 h
4~4 {
H;x
s/(.*GCATGAAAACATACA.{3})(.*)/\1\r\2/
}
/\r/ {
:1
s/^.(.{2}[^\r].*)/\1/
T2
s/(\n).(.*)/\1\2/
t1
:2
s/^(.{3}).*/\1/mg
s/(.*)\n(.*)/Codon:\t\1\tQuality Score:\t \2/p
}

Xterra · January 14, 2019, 2:55pm

Don
I modified a bit your script to output the total count and give some format:

awk -v gene="gene-a gene-b" -v lengths="3 6" -v strings="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN {	nString = split(strings, String)
	split(lengths, OutLen)
	split(gene, Id)
	for(i = 1; i <= nString; i++)
		StringLen = length(String)
}
/^@/ {	getline CodonLine
	getline
	getline QualityLine
	for(i = 1; i <= nString; i++)
		if(spot = index(CodonLine, String))
			printf("Gene:\t"Id"\tCodon:\t%s\t\tQuality Score:\t%s\t\n",
			    substr(CodonLine, spot + StringLen, OutLen),
			    substr(QualityLine, spot + StringLen, OutLen))
}' test.txt | awk '{ count[$0]++ } END {{ print "\n\t\t\t\tSummary\n#############################################################################\nCount\t\tGene\t\tCodon\t\t\tQuality Score\n" } {for (gene in count ) print count[gene] "\t" gene | "sort -k 3"}}'

With the above script I am getting the desired output:


                                Summary
#############################################################################
Count           Gene            Codon                   Quality Score

1       Gene:   gene-a  Codon:  AAC             Quality Score:  ,ED
2       Gene:   gene-a  Codon:  AAC             Quality Score:  GCC
1       Gene:   gene-a  Codon:  TTT             Quality Score:  +GG
2       Gene:   gene-b  Codon:  TCCAAG          Quality Score:  DGGCGG
1       Gene:   gene-b  Codon:  TCCAAG          Quality Score:  G7DCGG

However, I tried to include the END step in your awk script fail miserably. How can I modify the script so I don't have to "stitch" together the two scripts as shown above?
Thanks!

Don_Cragun · January 14, 2019, 10:51pm

Hi Xterra,
Maybe something like:

awk -v genes="gene-a gene-b" \
    -v lengths="3 6" \
    -v strings="GCATGAAAACATACA TTTCCAGAAATTGT" '
BEGIN {	nString = split(strings, String)
	split(lengths, SLen)
	split(genes, Id)
	for(i = 1; i <= nString; i++)
		StringLen = length(String)
	sort_cmd = "sort -k3,3 -k5,5 -k8,8"
	print "\n\t\t\t\tSummary"
	print "#############################################################" \
	    "################"
	print "Count\t\tGene\t\tCodon\t\t\tQuality Score\n"
}
/^@/ {	getline CodonLine
	getline
	getline QualityLine
	for(i = 1; i <= nString; i++)
		if(spot = index(CodonLine, String))
			count[sprintf( \
			    "Gene:\t%s\tCodon:\t%s\tQuality Score:\t%s",
			    Id,
			    substr(CodonLine, spot + StringLen, SLen),
			    substr(QualityLine, spot + StringLen, SLen)) \
			]++
}
END {	for(i in count)
		printf("%d\t%s\n", count, i) | sort_cmd
}' test.txt

which produces the output:

				Summary
#############################################################################
Count		Gene		Codon			Quality Score

1	Gene:	gene-a	Codon:	AAC	Quality Score:	,ED
2	Gene:	gene-a	Codon:	AAC	Quality Score:	GCC
1	Gene:	gene-a	Codon:	TTT	Quality Score:	+GG
2	Gene:	gene-b	Codon:	TCCAAG	Quality Score:	DGGCGG
1	Gene:	gene-b	Codon:	TCCAAG	Quality Score:	G7DCGG

I'm sure you could write this as a 1-liner, but I much prefer something I can see on a screen (and debug).

If there's anything here you can't figure out, ask questions about what you don't understand.

Hope this helps,
Don