Selecting sequences based on scores

I have two files with thousands of sequences of different lengths. infile1 contains the actual sequences and infile2 the scores for each A, T, G and C in infile1. Something like this:
infile1:

>HZVJKYI01ECH5R
TTGATGTGCCAGCTGCCGTTGGTGTGCCAA
>HZVJKYI01AQWJ8
GGATATGATGATGAACTGGTTTGGCACACC
>HZVJKYI01C8OAV
GGATATGATGATGAACTGGTTTGGCACACC
>HZVJKYI01AXR15
TTGATGTGCCAGCTGCCGTTGGTGT
>HZVJKYI01EDZM4
TGATGTGCCAGCTGCCGTTGGTGTACCAGT

infile2:

>HZVJKYI01ECH5R
18 20 30 30 38 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 38 30 30 20 20 
>HZVJKYI01AQWJ8
26 26 26 38 40 40 40 40 40 40 39 35 32 31 31 32 32 17 17 16 16 16 27 27 34 34 40 34 23 26 
>HZVJKYI01C8OAV
29 29 29 34 39 39 40 38 38 40 40 40 36 36 36 36 34 33 33 36 33 33 37 30 32 32 30 30 30 30 
>HZVJKYI01AXR15
21 21 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
>HZVJKYI01EDZM4
26 30 36 40 40 40 40 40 40 40 40 40 40 40 40 30 21 21 21 34 34 40 40 40 40 40 40 40 40 40

I need to output the sequences where the average score is >35, the length of the sequence >28 and a minimal score >20. From this example, the first sequence will not be included in the outfile because the minimal score is 18. The second sequence will be also excluded because the minimal score is 16 and the average is 30.5. The third sequence is eliminated because the average score is <35. The forth sequence will also not be deleted because the length is less than 28. Thus, the outfile will contain only one sequence:

outfile:

>HZVJKYI01EDZM4
TGATGTGCCAGCTGCCGTTGGTGTACCAGT

I would like to use awk script so I can include it in a bash file. However, I do not know how to match the infiles. Any help will be greatly appreciated.

Build an array with only elements of those records in file 2 that have those criteria. Then use that array to print the corresponding records in file 1,

I will be very thankful if you can be more specific. I am really trying to understand the whole think but I just cannot get it to work.

Try this

awk 'NR==FNR{q=$1;getline;s=d=0;for(i=1;i<=NF;i++){if($i<20&&!d){d=1}s+=$i}a=s/NF;if(NF>=28&&!d&&a>=35){z[q]++}}
z[$1]{print;getline;print}' infile2 infile1

--ahamed

1 Like

Thanks! That works great!