Help in extracting multiple files and taking average at same time

ahjiefreak · August 23, 2008, 11:23am

Hi,

I have 20 files which have respective 50 lines with different values.

I would like to process each line of the 50 lines in these 20 files one at a time and do an average of 3rd field ($3) of these 20 files. This will be output to an output file.

Instead of using join to generate whole bunch of redundant files and then compute the average, im looking any other possible better way to do the above right away.

E.g

apple.txt
tool1 2.00 4 30.20
tool2 3.00 5 40.22
tool3 2.00 6 45.32
....
tool50 ...........

orange.txt
tool1 1.00 2 30.20
tool2 4.00 3 40.22
tool3 6.00 4 45.32
...
tool50 ...

bar.txt
tool1 2.10 1 30.20
tool2 3.04 4 40.22
tool3 2.02 5 45.32
...
tool50 .....

and the remaining 17 files of different names.

The output would be:-
tool1 (4+2+1+....)/20
tool2 (5+3+4+...)/20
tool3 (6+4+5+...)/20
....
tool50....

Please advise. THanks.

vgersh99 · August 23, 2008, 12:03pm

Pls show the proof of the effort.

ghostdog74 · August 23, 2008, 12:03pm

I see you have been here for a while. How about showing some code on what you did that doesn't work?

ahjiefreak · August 23, 2008, 6:04pm

Hi,

My initial thought/idea was:-

!/bin/awk

filename={apple.txt,orange.txt,bar.txt................}

for file in filename;do

while(getline<"file"); do

val+=$3;
count++;

done

done
tada=val/count;

print $1"," tada>output.txt

That's my first intuition to the above case problem using awk.

ghostdog74 · August 23, 2008, 8:54pm

Is that just your pseudocode or you are definitely writing an awk script? you seem to be mixing shell and awk syntaxes all over.

era · August 23, 2008, 9:41pm

awk offers arrays for precisely this type of problem. Make each item with a count a key of the array, then at END print the results. If all keys occur in all files you can simply divide by the number of input files, otherwise you will need to collect both the sum and the count (divisor) for each key.

summer_cherry · August 25, 2008, 1:43am

below code should be ok, it assume all file with same naming convention .txt,
nomatter how many files, it can process them

sum=`ls -l *.txt | wc -l`
paste -d" " *.txt | nawk -v s="$sum" '{
	for(i=0;i<=s-1;i++)
	{
		t=3+(i*4)
		temp=temp+$t
	}
	print $1"---->"temp/s
	temp=0
}'

manosubsulo · August 25, 2008, 2:47am

This might help you..

rm -f output.txt
count=1
while [ "$count" -le "50" ]
do
grep_var="tool$count"
f_avg=`grep "$grep_var" *.txt | awk '{sum+= $3;} END { print sum/NR}'`
echo "$grep_var : $f_avg" >> output.txt
count=`expr $count + 1`
done

Note:
Output will be written in output.txt file

era · August 26, 2008, 2:55am

The approach proposed by summer_cherry is kind of clever, although you should note that it has several assumptions: it assumes that all the files contain all the same tools, in the same order, and that they all have exactly four columns.

Also, if there are many files, the backticks will overflow. Moreover, it's rather excessive to use ls -l when you don't care for the long format, just the number of files. And since you are pasting the files, the number of columns will indicate how many files there are, so you can calculate the sum in awk from that.

paste -d " " *.txt | nawk '{
  for (i=0; i < $NF; ++i) sum+=$(3+(i*4))
  print $1 " " sum/($NF*4)
}'