Help: Counting values less than a number

Alexandryne · May 25, 2016, 1:34pm

So I have several files (35000, to be exact) in the format rmsd_protein_*.dat each with 2 columns and 35000 rows.

I would like to count how many values in the second column are less than 3 for each file, and output it into a new file so that it ultimately appears as:

1           14057
2           20598

.....

where the left column is the * value from the data file, and the number on the left is the number of values from that file less than three.

Unfortunately, I'm a n00b as far as shell scripts go, and even though this would be the easiest thing in the world in JAVA I just can't figure it out in scripts. Thoughts?

rdrtx1 · May 25, 2016, 2:01pm

#!/bin/sh

ls -1 | grep "rmsd_protein_.*.dat" | while read file
do
   f=${file##*_}
   f=${f%%[.]*}
   awk '$2 < 3 {c++} END {print f, c}' f=$f $file
done > new_file.dat

Alexandryne · May 25, 2016, 2:15pm

Unfortunately I get the error "while: Expression Syntax." when I try to run that...?

rdrtx1 · May 25, 2016, 2:18pm

try sh, bash, or ksh.

Alexandryne · May 25, 2016, 2:25pm

C shell (on Red Hat Enterprise Linux Workstation release 6.5 (Santiago), if that matters in any way)

rdrtx1 · May 25, 2016, 2:31pm

for csh :

#!/bin/csh

foreach file ( "`ls -1 | grep 'rmsd_protein_.*.dat' `" )
   awk '$2 < 3 {c++} END {sub(".*_", "", f); sub("[.].*", "", f); print f, c}' f=$file $file
end

Scrutinizer · May 25, 2016, 2:35pm

I am unsure what you mean exactly, but I take it you want one line for each file with the number of lines where $2 was <3 plus the total number of lines for that file.. If so, this should be fairy quick:

find . -name 'rmsd_protein_*.dat' -exec awk 'FNR==1{if(NR>1) print c,t; c=0} $2<3{c++}{t=FNR} END{print c,t}' OFS='\t' {} + > newfile.dat

Alexandryne · May 25, 2016, 2:49pm

Ah ha, this runs and reports the results as desired -- however, I would like it in a dat file rather than just printed to the screen, if possible. The only way I've done an awk to a file before is

awk 'insertawkcommandshere' > newfile.dat

but that enters an infinite loop with this; how do I fix that?

rdrtx1 · May 25, 2016, 2:51pm

The loop is not infinite. It ends when the last file is processed.

#!/bin/csh

rm -f new_file.dat

foreach file ( "`ls -1 | grep 'rmsd_protein_.*.dat' `" )
   awk '$2 < 3 {c++} END {sub(".*_", "", f); sub("[.].*", "", f); print f, c}' f=$file $file >> new_file.dat
end

Alexandryne · May 25, 2016, 3:02pm

It certainly looks like it should work to me, but this creates an empty file and still prints the results to the screen instead..?

---------- Post updated at 02:02 PM ---------- Previous update was at 01:59 PM ----------

(additional note: I tried putting the "> data.dat" after the "f= $file $file" and then it at least filled in the data document rather than printing to the screen, but every new line overwrites the last line...)

rdrtx1 · May 25, 2016, 3:03pm

see corrected script.

Alexandryne · May 25, 2016, 3:06pm

Eyyyy, perfect! Thanks!

Scrutinizer · May 25, 2016, 4:26pm

OK, with these requirements this is a version without a shell loop, which should run fairly quickly:

find . -name 'rmsd_protein_*.dat' -exec awk 'FNR==1{if(NR>1) print F[2],c; split(FILENAME,F,/.*_|[.]/); c=0} $2<3{c++}END{print F[2],c}' OFS='\t' {} + > newfile.dat