Summing Quality Scores in FASTQ File

Hello, I am currently trying to work through an assignment in which we are tasked with creating a python program to: iterate through a FASTQ file, convert the phred scores into quality scores, and then add the sum of each quality score correlated with each nucleotide position into an array. For instance, position 0 of the array will contain the sum of all the quality scores of the first nucleotide of each sequence in the file. We first had to create an array filled with 102, 0.0 values as place holders for the sums (each sequence is 102 nucleotides long). I have so far been able to iterate and convert the phred scores into quality scores, but I do not know how to iterate through the scores any further.

array =	[0.0] *	102
def convert_phred(c):
    return ord(c) - 33
i = 1
for lines in fh:
    i+=1
    if i % 4 == 1:
        for char in lines:
      	    print(convert_phred(char))

hi,

additional clarification is required -
how are bytes in a given data row correlated to an array element
(FASTQ sequences can be longer/shorter than 102 bytes)

If can you show some sample data you are using and what the expected output would be in array[...] for that data that would be helpful

tks

Hi, so for example:
If I had two reads, 5 nucleotides long, in the file. Once I specify the quality score indicator lines with:

i = 1
for lines in fh:
    i+=1
    if i % 4 == 1:
        print(lines)

I get an output of:
CFHIJ
JIHFC

So the phred scores would be:
34 37 39 40 41
41 40 39 37 34

When I convert it to phred scores with the loop I wrote

for char in lines:
      	    print(convert_phred(char))

I get an output of all the phred scores together:
34
37
39
40
41
41
40
39
37
34

I need the array to be the average of each nucleotide position's scores (array[0] would equal (34 + 41) / 2), and so forth. So with this set my array would change from:
array = [0.0, 0.0, 0.0, 0.0, 0.0]
to:
array = [37.5, 38.5, 39.0, 38.5, 37.5]

In the real file each read is 101 nucleotides long, and there are 16 million Illumina reads.

using most of your code I put the following together,
let me know if this is sort of what you are after ...

# tested on linux and macos - python 3.x

cat agg.py 
import sys

hndl=open(sys.argv[1], "r")    # open the file for readonly access

asize=int(sys.argv[2]) if len(sys.argv) == 3 else 102    # set the array size (or 102 if none supplied) , useful for testing
cellsArray=[ i for i in range(asize)]

rows=0 # used as the divisor for averages at the end of the run

while(True):
    arrelem=0
    nucleo = hndl.readline()
    if nucleo:
        for byte in nucleo:
            cellsArray[arrelem] += (ord(byte)-33)
            arrelem += 1
            if arrelem >= asize:
                break
            
        rows+=1
    else:
        print("accumulative:", cellsArray)
        print("")
        for cell in range(len(cellsArray)):     # calculate averages
            cellsArray[cell] /= rows
        break

print("    averages:", cellsArray )
print("\nprocessed ",rows," rows" )

sample FASTQ data

cat abit
CGGTAGCCAGCTGCGTTCAGTATGGAAGATTTGATTTGTTTAGCGATCGCCATACTACCGTGACAAGAAAGTTGTCAGTCTTTGTGACTTGCCTGTCGCTCT
GATGCATACTTCGTTCGATTTCGTTTCAACTGGACAACCTACCGTGACAAAGAAAGTTGTCGATGCTTTGTGACTTGCTGTCCTCTATCTTCAGACTCCTTG
CGTATGCTTTGAGATTCATTCAGGAGGCGGGTATTTGCTCGATCATACCATACGTGGCAAGAAAGTTGTCAGTGTCTTTGTGTTTCTCTGTGGTGCGCGATA
CGATTATTTGGTTCGTTCATGTGGAACAGGTAACATGGATAAATACCCAGCCTATTTATTGACAGAAGTTATCAGTGTCTTTGTATTGCCTGTCGCTCTGTC
GTTGTACTTTACGTTTCAATGGAACAGTATTGTTTAACACTGATTCAGTGACTTAAAGAAATTGTCGGTATCTTTGGCAACTTACTGTCCCGCTCTATCTTC

test run, just for a few elements (12) of the data

python3 agg.py abit 12
accumulative: [178, 198, 225, 213, 223, 177, 210, 226, 227, 238, 203, 213]

    averages: [35.6, 39.6, 45.0, 42.6, 44.6, 35.4, 42.0, 45.2, 45.4, 47.6, 40.6, 42.6]

processed  5  rows

Hello, when I run this code I get the sum of all phred scores as all my values in the array. I cannot use sys since that is something that has not been taught to us yet. To open files we use:

in_file = "path to file"
fh = open(in_file, "r")

Here is my code using your commands:

array = [0.0] * 101
def convert_phred(c):
    return ord(c) - 33
i=1
rows=0
a=0
if fh:
    for lines in fh:
        i+=1
        if i % 4 == 1:
            for char in lines:
                for a in range(len(array)):
                    array[a] += convert_phred(char)
                    a += 1
                    if a >= len(array):
                        break

            rows += 1
else:
    for cell in range(len(array)):
        array[cell] /= rows
        break

print(array)

For example, my output array is [1788.0, 1788.0, 1788.0, 1788.0....] Which is the sum of all the phred scores in each line.

@Lafilleseule , ok, so are you 'happy' with what you have got now ?

(having ran your code I am sure there are issues ),
use print statement to help you debug intermediate processing and end results.

updated your version , minimal changes - run this against the data in the attached file 'abit' (20 records), see if the numbers match those you expect, repeat for your test data ( post that if able along with expected results - I can test that here)

users.py (1.3 KB)

abit (2.0 KB)

NB: its vital to include 'debug details' (print statements showing intermediate results etc) when developing - especially when beginning your coding journey .
additionally, learning how to use a debugger will bring massive rewards in a short time - even if it feels awkward/alien to begin with.

2 Likes

Thank you! I was mixing up "rows" and "i."

1 Like

can you confirm we can close this thread ?

(obviously, start a new one for further help on new challenges)

Hi @Lafilleseule,

some notes about your program:

  • The array has a fixed size, so you can store it in a var and don't need to repeat len().
  • The convert_phred() function is very simple and only called in one place, so you can replace the call (which costs extra time) with the function body. In addition, the subtraction of the constant 33 can be moved to the end, which reduces calculation effort.
  • The if fh condition is useless: If the file can't be opened, there is no sense in looping over the array (which will also be aborted by break, why?) since it contains only 0s. In addition, the program would then terminate at fh = ... anyway.
  • For file I/O, use with open(), as it guarantees the correct closing of the file handler.
  • Instead of counting the processed lines via rows += 1, you can just loop over every 4th line in combination with enumerate.
  • The lines var should be named line because it contains exactly one line.
for char in lines:
    for a in range(len(array)):
        a += 1
        if a >= len(array):
            break

is way too complicated and confusing: a) it squares the effort per line, b) a is already an index and therefore does not need to be incremented explicitly and c) if a ... is redundant, since a already only counts up to len(array) - 1.

So, here's a shortened and much faster, but not optimal version:

LEN = 101
FILE = "data.txt"

arr = [0] * LEN
with open(FILE) as fin: # open() without opening mode defaults to "r"
    # process every 4th line, start counting at 1
    # [...] = list comprehension, see below
    for (n, line) in enumerate([ln for ln in fin][::4], 1):
        # list comprehension, more or less the same as
        # for i in range(LEN):
        #     arr[i] += ord(line[i])
        arr = [arr[i] + ord(line[i]) for i in range(LEN)]
# n now contains the no. of processed lines
arr = [arr[i] / n - 33 for i in range(LEN)]
print(arr)

One last note: For a task with this amount and type of data (1.6 GB of ASCII text), python is rather unsuitable, it would be better to use awk.