Select lines where at least x columns above threshold value

pathunkathunk · March 14, 2013, 1:30pm

I have a file with 20 columns. I'd like to retain only the lines for which the values in at least x columns, looking only at columns 6-20, are above a threshold.

For example, I'd like to retain only the lines in the file below that have at least 8 columns (again, looking only at columns 6-20) with the value of at least 0.75. (I would like to be able to easily modify the code so that I could play around with the number of minimum columns (8 in this case) as well as the threshold (0.75)).

File:

s_20331    822    1    1.000    5.0    0.00000000    0.14395044    0.00000000    0.00000000    0.00000000    0.20102041    0.00000000    0.00000000    0.00000000    0.28091837    0.11224490    0.03571429    0.00000000    0.00000000    0.00000000
s_20416    154    1    1.000    5.0    0.00000000    1.00000000    0.66666667    0.40000000    0.30216165    1.00000000    0.66666667    0.45142857    0.35714286    0.11111111    0.32659933    0.55245256    0.17424242    0.32832080    0.10345717
s_20476    114    1    1.000    5.0    0.00000000    1.00000000    0.42857143    0.85100619    1.00000000    1.00000000    0.42857143    0.86996904    1.00000000    0.25000000    0.13039843    0.00000000    0.19697069    0.25000000    0.10607391
s_20477    162    1    1.000    6.0    0.20987654    0.79423868    0.81481481    0.78395062    0.77777778    1.00000000    1.00000000    1.00000000    1.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000

Output:

s_20477    162    1    1.000    6.0    0.20987654    0.79423868     0.81481481    0.78395062    0.77777778    1.00000000    1.00000000     1.00000000    1.00000000    0.00000000    0.00000000    0.00000000     0.00000000    0.00000000    0.00000000

I'm a novice and all I have so far is an awk command to set a threshold in individual columns, and then pipe that to another awk command screening another column. This obviously is inelegant as well as ineffective for allowing some columns to remain below the threshold.

awk '{if($6>=0.75)print;}' | awk '{if($7>=0.9)print;}' | awk '{if($8>=0.9)print;}'  | awk '{if($9>=0.9)print;}' [...etc]

Don_Cragun · March 14, 2013, 2:39pm

You could try something like:

#!/bin/ksh
# SYNOPSIS:
# colcheck [file [first_column [last_column [threshhold [pass_count]]]]]
# DESCRIPTION:
# Print all lines in the file named by "file" (default file is input) in which
# at least "pass_count" (default value 8) values in columns "first_column"
# (default value 6) through "last_column" (default value 20) are greater than or
# equal to "threshold" (default value 0.75).
file=${1:-input}
fc=${2:-6}
lc=${3:-20}
threshold=${4:-0.75}
pass_count=${5:-8}
awk -v f="$fc" -v l="$lc" -v t="$threshold" -v p="$pass_count" '
{       c = p
        for(i = f; i <= l && c; i++) if($i >= t) c--
        if(c == 0) print
}' "$file"

If you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk .

I use the Korn shell, but this should also work with any other shell that accepts Bourne shell syntax (such as bash).

rdrtx1 · March 14, 2013, 2:52pm

try also:

awk '{count=0; for (col=6; col<=20; col++) ($col >= .75) ? count++ : 0; if (count>=8) print}' infile

RudiC · March 14, 2013, 5:41pm

In your sample code, you don't have identical thresholds for the columns, but in your spec, you do. I'll assume the latter, as it's easier for a start.
For playing around, it might be best to have all parameters as variables:

$ awk '{cnt=0; for (i=FST; i<=LST; i++) cnt+=($i>THR)} cnt>=MIN' FST=6 LST=20 THR=0.75 MIN=8 file
s_20477    162    1    1.000    6.0    0.20987654    0.79423868    0.81481481 etc . . .

or, shamelessly stealing Don Cragun's ideas, this should do as well:

d$ awk '{cnt=MIN; for (i=FST; i<=LST && cnt; i++) cnt-=($i>THR)} !cnt' FST=6 LST=20 THR=0.75 MIN=8 file
s_20477    162    1    1.000    6.0    0.20987654    0.79423868    0.81481481    0.78395062 etc . . .

If you want exactly MIN columns to exceed the threshold, remove the && cnt in the for (...) .