Bash/awk script problem

dsp80 · March 28, 2015, 4:46pm

Hi,

I have 100 files containing different values in single column,
I want to split those files in two separate files (file2 and file3) based on average value of first column of each file,

for those files I am working on the following script

#bin/bash 
for memb in $(seq 1 100)
do
 awk '{s+=($1); avg = s/NR} END{print avg}' file1_$memb.dat
   
 if [ avg > 0.28 ]; then
      paste file1_$memb > file2.dat
   else
      paste file1_$memb.dat > file3.dat
    fi                             
 done

basically, I want to divide 100 files based on average value and paste each column of files in corresponding files.

thanks,

sea · March 28, 2015, 7:04pm

Since bash cannot handle floating numbers, i'd let awk do the compare with 0.28, and print a 0 if so, and a 1 if not, into a variable.
Then compare that variable with 0 or 1 in your if block.

hth

mjf · March 28, 2015, 7:33pm

Couple of comments:

Output to file2.dat and file3.dat should be appended (use redirection symbol >>).
File being pasted if condition is true is missing the .dat file extension.
Variable avg in awk statement is not available from outside in IF condition. You can do something like this:

var_avg=$(awk ........)

If you want to compare percentage .28 then you should divide by 100 in awk.

dsp80 · March 28, 2015, 9:02pm

Hi Sea,

Can you please elaborate..

Don_Cragun · March 28, 2015, 9:14pm

In addition to what sea and mjf have already noted, even if you do append the output of paste (rather than overwriting the output from paste ) you still won't get what you want. The paste utility doesn't paste the input file onto the ends of lines it reads from the output file; it pastes lines from one or more input files to create lines in the output file.

You could run paste once for each each input file in turn to paste it at the ned of the appropriate output file into a temp file and move the temp file back to the appropriate output file. But running awk 100 times and running paste 100 times seems highly inefficient to me. I'd suggest something more like:

#!/bin/bash
rm -f  file2.dat file3.dat
awk '
BEGIN {	outf[1] = "file2"
	outf[0] = "file3"
}
function choose(infile) {
	if(ofc[of = (s / c) > .28]++ == 0)
		printf("paste \"%s\"", infile) > (outf[of] ".sh")
	else	printf(" \"%s\"", infile) >> (outf[of] ".sh")
	c = s = 0
}
FNR == 1{
	if(NR > 1)
		choose(fn)
	fn = FILENAME
}
{	c++
	s += $1
}
END {	choose(fn)
	for(i = 0; i <= 1; i++)
		if(ofc) {
			printf(" > %s.dat\n", outf) >> (outf ".sh")
			printf("chmod +x %s.sh;./%s.sh;echo rm %s.sh\n",
				outf, outf, outf)
		}
}' file1_[1-9].dat file1_[1-9][0-9].dat file1_100.dat | bash

which runs awk once, paste once or twice, and chmod once or twice, and bash one extra time to run the commands produced by this awk script.

This code creates one or two temporary shell scripts to paste the appropriate input files together into your final output files, runs those scripts, and uses echo to remind you to remove those temporary shell scripts. If it does what you want; remove the echo in red in the awk script to actually remove the temporary scripts after that have been run.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .