Match first two columns and average third from multiple files

ncwxpanther · January 30, 2015, 3:56pm

I have the following format of input from multiple files

File 1

 24.01  -81.01    1.0
 24.02  -81.02    5.0
 24.03  -81.03    0.0

File 2

 24.01  -81.01    2.0
 24.02  -81.02   -5.0
 24.03  -81.03   10.0

I need to scan through the files and when the first 2 columns match I need to print out the first two columns and the average of the third.

Expected output:

 24.01  -81.01    1.5
 24.02  -81.02    0.0
 24.03  -81.03    5.0

I was using this code but I am not getting the expected output (data are wrong).

awk -F, '{array[$1 $2]+=$3} END { for (i in array) {print i array}}' input.file

I think I need to remove the -F, (since there are no commas) and divide the array by number of records (NR).

Any help is appreciated

jim_mcnamara · January 30, 2015, 4:22pm

awk '{array[$1 $2]+=$3; a[$1 $2]++} END { for (i in array) {print i, array/(a)}}'  file1 file2 > newfile

Try that for a start.

senhia83 · January 30, 2015, 4:27pm

Try

 awk '{arr[$1,$2]   += $3
        count[$1,$2] += 1
    }
    END{
        for (a in arr) {
            print  a  arr[a] / count[a]
        }
    }
' file*

ncwxpanther · February 1, 2015, 8:37am

The results are as expected, but the format is not. What is the command to preserve the original format.

Scrutinizer · February 1, 2015, 9:03am

Try this modification:

Any awk:

awk '{v=$NF; sub(/ +[^ ]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' file1 file2

GNU awk:

gawk '{A[$1$2]+=$3; C[$1$2]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' FIELDWIDTHS="6 8 9" file1 file2

ncwxpanther · February 1, 2015, 9:19am

[quote=scrutinizer;302933668]
Try this modification:

Any awk:

awk '{v=$NF; sub(/ +[^ ]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' file1 file2

Great! The first awk statement worked.

Now Im looking for the correct way to sort the first column (ascending).

Scrutinizer · February 1, 2015, 9:27am

Try running it through sort :

awk ... | sort -n

ncwxpanther · February 2, 2015, 8:13am

I was hoping to modify your script to subtract a field (3rd column) in File A from another field (3rd column) in File B, but its not working as I had thought. I am not familiar enough with Awk to make it work.

Any help is appreciated!

File A

24.01  -81.01    1.0
 24.02  -81.02    5.0
 24.03  -81.03    0.0

File B

24.01  -81.01    2.0
 24.02  -81.02    3.0
 24.03  -81.03    0.0

Result Output

24.01  -81.01   -1.0
 24.02  -81.02   2.0
 24.03  -81.03   0.0

RudiC · February 2, 2015, 9:50am

If you have just two files, and if you want to subtract file2's values from file1's,and if you accept some format change, this might help you:

awk 'FNR==NR {A[$1,$2]=$NF; next} {A[$1,$2]-=$NF} END{for(i in A) printf "%s%7.1f\n",i,A}' SUBSEP=" " file1 file2
24.01 -81.01   -1.0
24.02 -81.02    2.0
24.03 -81.03    0.0

ncwxpanther · February 4, 2015, 10:32am

Thanks to all for the help.

I have 1 final question.

I need to perform the following calculation on column 3.

awk '{$3 = ($3*9)/5; print}'

Input

46.0625 -99.1042    4.9
35.4792 -79.6042   -0.6
42.5208 -71.4792   -1.7

Expected Output

46.0625 -99.1042  8.8
35.4792 -79.6042 -1.1
42.5208 -71.4792 -3.1

I need to preserve the 4 sig digs in columns 1 and 2, but round column 3 to 1 sig dig.

I am close with this code but it wraps the 3rd column to the next line.

awk '{print $1, $2; printf("%.1f\n"), $3 = ($3 * 9)/5}'

RudiC · February 4, 2015, 10:45am

Take advantage of awk 's string/number dualism, and use the %s format specifier:

awk '{printf "%s %s %.1f\n", $1, $2,  ($3 * 9)/5}' file
46.0625 -99.1042 8.8
35.4792 -79.6042 -1.1
42.5208 -71.4792 -3.1

ncwxpanther · February 12, 2015, 3:38pm

I now have the need to match the first two columns and when they match, calculate the percentage of average for the third columns. I cant seem to borrow from the same awk script and output the expected result.

percent of average = diff/avg * 100

Ill need to ignore/remove the minus/negative symbols if their are any in the output file.

Diff File

46.0625 -99.1042 8.8
35.4792 -79.6042 -2.1
42.5208 -71.4792 -3.1

Avg File

46.0625 -99.1042    4.9
35.4792 -79.6042   3.3
42.5208 -71.4792   10.0

expected output

46.0625 -99.1042    180
35.4792 -79.6042     63
42.5208 -71.4792     31

Any suggestions?

RudiC · February 12, 2015, 4:24pm

Please open a new thread for a new request!

Try

awk 'NR==FNR {T[$1,$2]=$3; next} $1,$2 in T {P=T[$1,$2]/$3*100; printf "%s %s %.0f\n", $1, $2, (P>=0)?P:-P}' file1 file2
46.0625 -99.1042 180
35.4792 -79.6042 64
42.5208 -71.4792 31

ncwxpanther · February 13, 2015, 10:25am

This has been resolved. Thanks for the help RudiC!

ncwxpanther · March 5, 2015, 11:17am

[quote=scrutinizer;302933668]
Try this modification:

Any awk:

awk '{v=$NF; sub(/ +[^ ]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' file1 file2

Turns out that the above code did not work for me, at least in the way that I am using it.

I have the following script set up:

foreach numb (`seq 1 100`)
awk '{array[$1 $2]+=$3; a[$1 $2]++} END { for (i in array) {print i, array/(a)}}' /dir/of/data/${numb}02.pnt > output.txt
end

I expect the code to search through each of the 100 files and when the first 2 columns match, average the 3rd.

Input File 102:

43.1042	-78.6042	-6.07
25.2708	-81.1458	19.27

Input File 202:

43.1042	-78.6042	-4.88
25.2708	-81.1458	17.76

Output:

43.1042	-78.6042   -5.5
25.2708	-81.1458	18.52

Any suggestions are appreciated!

Scrutinizer · March 5, 2015, 11:31am

Hi, try this modification. This should work better:

awk '{v=$NF; sub(/[ \t]+[^ \t]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' file1 file2

---

ncwxpanther:

[..]
Turns out that the above code did not work for me, at least in the way that I am using it.

I have the following script set up:
foreach numb (`seq 1 100`)
awk '{array[$1 $2]+=$3; a[$1 $2]++} END { for (i in array) {print i, array/(a)}}' /dir/of/data/${numb}02.pnt > output.txt
end

Looking at you method, i have the following remarks:

why use csh?
like this awk will be called 100 times and the file output will get overwritten 100 times.
the result will be wrong 100 times, because the script needs to read all 100 files for it to work properly.

With bash you could try:

cd /dir/of/data
awk '.....' "${numb}"{1..100} > output.txt

or, if that gives line line length problems, try:

cd /dir/of/data
for i in "${numb}"{1..100}
do
  cat "$i"
done |
awk '.....' > output.txt

ncwxpanther · March 5, 2015, 11:49am

Thanks for replying Scrutinizer, but im not getting the expected results.

For instance this is the data for 1 match:

-8.44
-7.86
-3.28
-9.16
-8.71
-4.76
-7.75
-6.83
-2.45
-6.42
-4.17
-7.73
-6.32
-9.33
-3.07
-7.76
-8.43
-6.08
-2.55
-7.19
-2.09
-2.9
-7.4
-7.18
-2.24
-6.29
-2.02
-4.44
-5.88
-2.36
-3.06
-1.64
-3
-12.19
-5.57
-8.51
-2.61
-2.32
-3.99
-5.02
-5.36
-6.93
-3.64
-4.52
-4
-4.92
-6.34
-4.88
-1.27
-4.88
-3.33
-2.31
-1.44
-0.48
-3.76
-2.84
-2.01
-7.14
-5.36
-3.62
-3.01
-6.29
-8.16
-5.08
-4.38
-4.44
-6.72
-7.19
-3.94
-4.85
-3.72
-6.18
-6.05
-6.09
-2.58
-0.59
-4.65
-9.03
-10.09
-6.33
-0.75
-5.53
-2.15
0.21
-4.23
-4.79
-4.61
-4.57
-5.4
-1.98
-1.18
-2.83
-7.05
-6.07
-6.02
-4.88
-1.35
0.5
-0.86
-1.55

This is the expected result:

-4.72

This is what your code produced

-1.6

Any ideas?

Scrutinizer · March 5, 2015, 11:50am

I had made an edit to my post, please have a look

ncwxpanther · March 5, 2015, 12:04pm

This worked.

for numb in "${elem}"/{1..100}02.pnt 
do
cat "$numb"
done |
awk '{v=$NF; sub(/[ \t]+[^ \t]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' > output