Sorting operations on several files

kayak · June 12, 2013, 8:17am

I have over 250 files (named grad.1000, grad.1001, grad.1002) - see attachment - that have this format:

# 0.004 0.692758
# 6.23025467936 6.23025467936 6.23025467936 44.9620453206 44.9620453206 44.9620453206
# 0.8 1.19989 0.99914606306 1.0117015948 1.03761854021 1.07717125288 1.13095455063
0.008 0 887 7.27461421247 11.2755138444 6.64899792979 0.000974619185608 4.33760040716e-06 97.4619182195 0.0867520800533 0.984433771469 -0.170586579294 -0.0423127469799 0.145837674011 0.658470875326 0.738341031764 
0.008 887 0 7.27461421247 11.2755138444 6.64899792979 0.000974619185608 4.33760040716e-06 97.4619182195 0.0867520800533 -0.984433771469 0.170586579294 0.0423127469799 -0.145837674011 -0.658470875326 -0.738341031764 
0.008 0 8845 7.90284821362 11.4658264202 7.26529902842 0.110530350443 0.00536257173099 11053.0350414 107.251434471 0.213274962743 -0.438677902466 -0.872969351212 -0.411755829948 -0.850655035361 0.326868700427 
0.008 8845 0 7.90284821362 11.4658264202 7.26529902842 0.110530350443 0.00536257173099 11053.0350414 107.251434471 -0.213274962743 0.438677902466 0.872969351212 0.411755829948 0.850655035361 -0.326868700427 
0.008 0 521 7.6646990021 10.5040106009 6.88903752062 0.00504147840528 0.000128931622848 504.147840346 2.57863215449 0.497796648814 0.79640579983 -0.343418547004 0.0167640041807 0.387056165816 0.921903732863 
0.008 521 0 7.6646990021 10.5040106009 6.88903752062 0.00504147840528 0.000128931622848 504.147840346 2.57863215449 -0.497796648814 -0.79640579983 0.343418547004 -0.0167640041807 -0.387056165816 -0.921903732863 
0.008 0 8462 8.51563871619 10.5252189655 6.56377074841 0.0695062123246 0.00333349601709 6950.62123031 66.6699200957 -0.593216665384 0.802236178796 0.0671647328714 -0.475994807994 -0.416810585143 0.774401626338

The first three lines with # should be ignored (also ignored for calculating the average).

What I want to do is to generate two sub-files from each file according to the condition:

if (col [9] / average (col[9]) ) > 1

print the whole lines in the first sub-file named grad_high.1000, grad_high.1001 etc T

else

if (col [9] / average (col[9]) ) < 1

print the whole lines in the second-subfile named grad_low.1000, grad_low.1001 etc

so that for 250 original grad files, I have 500 new sub-files. I have attached two files as example.

Can someone please help me with this using awk or any other scripting method?

jim_mcnamara · June 12, 2013, 10:23am

Ok. Is the column9 average based on all files - the average of column9 from grad.1000, grad.1001...grad.nnnn? Or is it based on just the file you are currently working on?

BTW the comparison

((col [9] / average (col[9]) ) < 1)

is the same as

(col[9] < average(col[9])

not using division.

kayak · June 12, 2013, 11:40am

The average is based on the file I am currently working on - so the average will be different for different files.
yes!

 
((col [9] / average (col[9]) ) < 1)

is the same as

(col[9] < average(col[9])

jim_mcnamara · June 12, 2013, 12:16pm

In any event this is either going to be a memory hog, or require two passes thru a file. Neither is a great choice. I'm opting for two passes.

#!/bin/bash
cd /path/to/directory
ls grad* >/tmp/files.lis
while read fname
do
    ext=${fname##*.}
    hi=grad_high.${ext}
    lo=grad_low.${ext}
    avg=$(awk '{sum+=$9; rows++} END {printf("%f", sum/rows) }' $fname)
    awk -v avg=$avg -v hi="$hi" -v lo="$lo" ' avg>=$9 {print $0 > hi; next} {print $0 > lo}' $fname   
done </tmp/files.lis

kayak · June 12, 2013, 12:43pm

Thanks Jim, this should work well well. I will run the script on a cluster, so no problems about memory. Would it be possible to omit the first three lines (with #) in each file in the average calculation? It seems it makes a tiny difference in the final result.