So I have several files (35000, to be exact) in the format rmsd_protein_*.dat each with 2 columns and 35000 rows.
I would like to count how many values in the second column are less than 3 for each file, and output it into a new file so that it ultimately appears as:
1 14057
2 20598
.....
where the left column is the * value from the data file, and the number on the left is the number of values from that file less than three.
Unfortunately, I'm a n00b as far as shell scripts go, and even though this would be the easiest thing in the world in JAVA I just can't figure it out in scripts. Thoughts?
#!/bin/sh
ls -1 | grep "rmsd_protein_.*.dat" | while read file
do
f=${file##*_}
f=${f%%[.]*}
awk '$2 < 3 {c++} END {print f, c}' f=$f $file
done > new_file.dat
Unfortunately I get the error "while: Expression Syntax." when I try to run that...?
C shell (on Red Hat Enterprise Linux Workstation release 6.5 (Santiago), if that matters in any way)
for csh
:
#!/bin/csh
foreach file ( "`ls -1 | grep 'rmsd_protein_.*.dat' `" )
awk '$2 < 3 {c++} END {sub(".*_", "", f); sub("[.].*", "", f); print f, c}' f=$file $file
end
I am unsure what you mean exactly, but I take it you want one line for each file with the number of lines where $2 was <3 plus the total number of lines for that file.. If so, this should be fairy quick:
find . -name 'rmsd_protein_*.dat' -exec awk 'FNR==1{if(NR>1) print c,t; c=0} $2<3{c++}{t=FNR} END{print c,t}' OFS='\t' {} + > newfile.dat
Ah ha, this runs and reports the results as desired -- however, I would like it in a dat file rather than just printed to the screen, if possible. The only way I've done an awk to a file before is
awk 'insertawkcommandshere' > newfile.dat
but that enters an infinite loop with this; how do I fix that?
The loop is not infinite. It ends when the last file is processed.
#!/bin/csh
rm -f new_file.dat
foreach file ( "`ls -1 | grep 'rmsd_protein_.*.dat' `" )
awk '$2 < 3 {c++} END {sub(".*_", "", f); sub("[.].*", "", f); print f, c}' f=$file $file >> new_file.dat
end
It certainly looks like it should work to me, but this creates an empty file and still prints the results to the screen instead..?
---------- Post updated at 02:02 PM ---------- Previous update was at 01:59 PM ----------
(additional note: I tried putting the "> data.dat" after the "f= $file $file" and then it at least filled in the data document rather than printing to the screen, but every new line overwrites the last line...)
OK, with these requirements this is a version without a shell loop, which should run fairly quickly:
find . -name 'rmsd_protein_*.dat' -exec awk 'FNR==1{if(NR>1) print F[2],c; split(FILENAME,F,/.*_|[.]/); c=0} $2<3{c++}END{print F[2],c}' OFS='\t' {} + > newfile.dat