awk RS/ORS error

yifangt · March 8, 2016, 5:17pm

Hello,
I am trying to filter fastq file (in short, every 4 lines to be a record) based on the GC counts (GC-contents) in sequence (i.e. field 2), which is the count % of the G/C chars in the string. The example script is to pick up records with GC contents > 0.6 in the sequence (second field).
One thing special is the "@" symbol is always the first char of the first row in each record, but it may appear in the third field of anywhere except the first position.
A sample input.file is:

@HWI-ST1410:193:C7847ANXX:3:1101:3144:2591
CCGCTTGGAGCGGATCAGGTAGTCGACCTGCTTAAGGAGGGC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@HWI-ST1410:193:C7847ANXX:3:1101:3050:2607
CAAAAAAAATTTTCTATTTTACATATACAATGAAGAACGTCACTG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFHHH
@HWI-ST1410:193:C7847ANXX:3:1101:3075:2609
CACTGTACTAAGCTTTGGCGCTGATTCCATAATTTCTTTCTC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@HWI-ST1410:193:C7847ANXX:3:1101:3098:2622
GGTACGTACACATAATCCGTTGACTAGCTCGATACGATTACG
+
BBBBBFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFF
@HWI-ST1410:193:C7847ANXX:3:1101:3097:2667
CCCGGCGGGAGAGGGACGGCAGGCTCGTCGGCGCCACAATCG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

So far, my script is:

awk 'BEGIN{RS="\n@", FS="\n"; OFS="\n"} {s=$2; if (gsub(/[GC]/, "x", $2)/length($2)>0.6) print "@"$1, s, $3, $4}' input.file

My script seems not doing what I want, as the first record always has double "@@" for its record/sequence name.
How to deal with the first record without the "\n@" as the RS? Sometimes the "@" symbol was NOT put back in front of $1 to have the original string.
I am using GNU Awk 4.0.1 under Linux 3.19.0-32-generic ~14.04.1 Ubuntu.
Thanks a lot for any clue!

rdrtx1 · March 8, 2016, 8:25pm

try:

awk '
/^@/ && m && c > .6 {printf m;}
/^@/ {r=NR+1; m=""; c=0;}
{m=m $0 "\n";}
r==NR {c=gsub(/[GC]/, "x")/length($0);}
END { if (m && c > .6) printf m;}
' input.file

Scrutinizer · March 8, 2016, 10:01pm

FWIW, the reason the first record gets a @@ , is because it still starts with @ , because the first record does not have a newline ( \n ) before it as is specified by RS="\n@"

So this could be corrected, by cutting the leading @ for the first line, using gawk4:

awk 'BEGIN{RS="\n@"; FS="\n"; OFS="\n"} FNR==1{sub(/^@/,x)} { s=$2; if (gsub(/[GC]/, "x", $2)/length($2)>0.6) print "@"$1, s, $3, $4}' file

Are you sure that what is now $2 in the sample is never more than one line?

yifangt · March 9, 2016, 10:10am

Thanks Scrutinizer!
Are you sure that what is now $2 in the sample is never more than one line?
If I catch your question correctly, I would say Yes, $2 is never more than one line because every record is "4 line", positive.

One more thing to confirm for sub(/^@/, x) , I was expecting the 'x' char at the beginning of the first record, but it is not there. When I read GNU manual about sub() function, it states:

.....Some versions of awk allow the third argument to be an expression that is not an lvalue.  
In such a case, sub() still searches for the pattern and returns zero or one, 
but the result of the substitution (if any) is thrown away because there is no place to put it.

so, in your script, the substitution 'x' was thrown away, but the original "^@" is put back by print "@"$1. Am I right?

Scrutinizer · March 9, 2016, 2:35pm

Hi yifangt, x is an unused variable (it is not a string, since there are no quotes around it), so it is equivalent to ""
so it can also be written as sub(/^@/, "") , which means delete the first @ of the record.

Using "" is probably clearer, so I would use that...

yifangt · March 9, 2016, 5:25pm

Hi Scrutinizer, thanks a lot for your explanation!
rdrtx1, can you, or anybody else, please elaborate your script? I have hard time to understand it.
Thanks again!

rdrtx1 · March 9, 2016, 6:00pm

awk '
/^@/ && m && c > .6 {printf m;}          # if line starts with @ and string m exists and criteria is > .6 print string m
/^@/ {r=NR+1; m=""; c=0;}                # if line starts with @ store the next record # in variable r; clear string m; clear criteria
{m=m $0 "\n";}                           # concatenate line to string m
r==NR {c=gsub(/[GC]/, "x")/length($0);}  # if record number matches stored record number evaluate criteria and store result in variable c
END { if (m && c > .6) printf m;}        # at end of file print string m if it exists and criteria is > .6
' input.file

yifangt · March 9, 2016, 7:09pm

Hmm, it seems I can understand individual line of your script, but not the whole logic. Can you comments my catch:

1) The first line /^@/ && m && c > .6 {printf m;} is used for previous record that has been scanned in the loop. For example, the first record (if its c > 0.6) of the input.file is not printed until the second record has been scanned. Am I right?
2) The END {if (m && c > .6) printf m;} block is only for the last record in the case the sequence has c>=0.6, right?

At first I thought m is for all the records and stored in the memory, which would slow down the process quite bit I guess. With a file of 10GB of size, your script still works as fast like the other one without any RAM issue, which brought me to confirm my understanding.

rdrtx1 · March 9, 2016, 7:52pm

1) yes. the previous record is stored in variable m
2) yes.

yifangt · March 9, 2016, 7:56pm

It was hard for me at first.
Thanks a lot!