Match first two columns and average third from multiple files

I have the following format of input from multiple files

File 1

 24.01  -81.01    1.0
 24.02  -81.02    5.0
 24.03  -81.03    0.0

File 2

 24.01  -81.01    2.0
 24.02  -81.02   -5.0
 24.03  -81.03   10.0

I need to scan through the files and when the first 2 columns match I need to print out the first two columns and the average of the third.

Expected output:

 24.01  -81.01    1.5
 24.02  -81.02    0.0
 24.03  -81.03    5.0

I was using this code but I am not getting the expected output (data are wrong).

awk -F, '{array[$1 $2]+=$3} END { for (i in array) {print i array}}' input.file

I think I need to remove the -F, (since there are no commas) and divide the array by number of records (NR).

Any help is appreciated

awk '{array[$1 $2]+=$3; a[$1 $2]++} END { for (i in array) {print i, array/(a)}}'  file1 file2 > newfile

Try that for a start.

1 Like

Try

 awk '{arr[$1,$2]   += $3
        count[$1,$2] += 1
    }
    END{
        for (a in arr) {
            print  a  arr[a] / count[a]
        }
    }
' file*

The results are as expected, but the format is not. What is the command to preserve the original format.

Try this modification:

Any awk:

awk '{v=$NF; sub(/ +[^ ]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' file1 file2

GNU awk:

gawk '{A[$1$2]+=$3; C[$1$2]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' FIELDWIDTHS="6 8 9" file1 file2
1 Like

[quote=scrutinizer;302933668]
Try this modification:

Any awk:

awk '{v=$NF; sub(/ +[^ ]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' file1 file2

Great! The first awk statement worked.

Now Im looking for the correct way to sort the first column (ascending).

Try running it through sort :

awk ... | sort -n
1 Like

I was hoping to modify your script to subtract a field (3rd column) in File A from another field (3rd column) in File B, but its not working as I had thought. I am not familiar enough with Awk to make it work.

Any help is appreciated!

File A

24.01  -81.01    1.0
 24.02  -81.02    5.0
 24.03  -81.03    0.0

File B

24.01  -81.01    2.0
 24.02  -81.02    3.0
 24.03  -81.03    0.0

Result Output

24.01  -81.01   -1.0
 24.02  -81.02   2.0
 24.03  -81.03   0.0

If you have just two files, and if you want to subtract file2's values from file1's,and if you accept some format change, this might help you:

awk 'FNR==NR {A[$1,$2]=$NF; next} {A[$1,$2]-=$NF} END{for(i in A) printf "%s%7.1f\n",i,A}' SUBSEP=" " file1 file2
24.01 -81.01   -1.0
24.02 -81.02    2.0
24.03 -81.03    0.0
1 Like

Thanks to all for the help.

I have 1 final question.

I need to perform the following calculation on column 3.

awk '{$3 = ($3*9)/5; print}'

Input

46.0625 -99.1042    4.9
35.4792 -79.6042   -0.6
42.5208 -71.4792   -1.7

Expected Output

46.0625 -99.1042  8.8
35.4792 -79.6042 -1.1
42.5208 -71.4792 -3.1

I need to preserve the 4 sig digs in columns 1 and 2, but round column 3 to 1 sig dig.

I am close with this code but it wraps the 3rd column to the next line.

awk '{print $1, $2; printf("%.1f\n"), $3 = ($3 * 9)/5}'

Take advantage of awk 's string/number dualism, and use the %s format specifier:

awk '{printf "%s %s %.1f\n", $1, $2,  ($3 * 9)/5}' file
46.0625 -99.1042 8.8
35.4792 -79.6042 -1.1
42.5208 -71.4792 -3.1
1 Like

I now have the need to match the first two columns and when they match, calculate the percentage of average for the third columns. I cant seem to borrow from the same awk script and output the expected result.

percent of average = diff/avg * 100

Ill need to ignore/remove the minus/negative symbols if their are any in the output file.

Diff File

46.0625 -99.1042 8.8
35.4792 -79.6042 -2.1
42.5208 -71.4792 -3.1

Avg File

46.0625 -99.1042    4.9
35.4792 -79.6042   3.3
42.5208 -71.4792   10.0

expected output

46.0625 -99.1042    180
35.4792 -79.6042     63
42.5208 -71.4792     31

Any suggestions?

Please open a new thread for a new request!

Try

awk 'NR==FNR {T[$1,$2]=$3; next} $1,$2 in T {P=T[$1,$2]/$3*100; printf "%s %s %.0f\n", $1, $2, (P>=0)?P:-P}' file1 file2
46.0625 -99.1042 180
35.4792 -79.6042 64
42.5208 -71.4792 31
1 Like

This has been resolved. Thanks for the help RudiC!

[quote=scrutinizer;302933668]
Try this modification:

Any awk:

awk '{v=$NF; sub(/ +[^ ]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' file1 file2

Turns out that the above code did not work for me, at least in the way that I am using it.

I have the following script set up:

foreach numb (`seq 1 100`)
awk '{array[$1 $2]+=$3; a[$1 $2]++} END { for (i in array) {print i, array/(a)}}' /dir/of/data/${numb}02.pnt > output.txt
end

I expect the code to search through each of the 100 files and when the first 2 columns match, average the 3rd.

Input File 102:

43.1042	-78.6042	-6.07
25.2708	-81.1458	19.27

Input File 202:

43.1042	-78.6042	-4.88
25.2708	-81.1458	17.76

Output:

43.1042	-78.6042   -5.5
25.2708	-81.1458	18.52

Any suggestions are appreciated!

Hi, try this modification. This should work better:

awk '{v=$NF; sub(/[ \t]+[^ \t]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' file1 file2

---

Looking at you method, i have the following remarks:

  • why use csh?
  • like this awk will be called 100 times and the file output will get overwritten 100 times.
  • the result will be wrong 100 times, because the script needs to read all 100 files for it to work properly.

With bash you could try:

cd /dir/of/data
awk '.....' "${numb}"{1..100} > output.txt

or, if that gives line line length problems, try:

cd /dir/of/data
for i in "${numb}"{1..100}
do
  cat "$i"
done |
awk '.....' > output.txt
1 Like

Thanks for replying Scrutinizer, but im not getting the expected results.

For instance this is the data for 1 match:

-8.44
-7.86
-3.28
-9.16
-8.71
-4.76
-7.75
-6.83
-2.45
-6.42
-4.17
-7.73
-6.32
-9.33
-3.07
-7.76
-8.43
-6.08
-2.55
-7.19
-2.09
-2.9
-7.4
-7.18
-2.24
-6.29
-2.02
-4.44
-5.88
-2.36
-3.06
-1.64
-3
-12.19
-5.57
-8.51
-2.61
-2.32
-3.99
-5.02
-5.36
-6.93
-3.64
-4.52
-4
-4.92
-6.34
-4.88
-1.27
-4.88
-3.33
-2.31
-1.44
-0.48
-3.76
-2.84
-2.01
-7.14
-5.36
-3.62
-3.01
-6.29
-8.16
-5.08
-4.38
-4.44
-6.72
-7.19
-3.94
-4.85
-3.72
-6.18
-6.05
-6.09
-2.58
-0.59
-4.65
-9.03
-10.09
-6.33
-0.75
-5.53
-2.15
0.21
-4.23
-4.79
-4.61
-4.57
-5.4
-1.98
-1.18
-2.83
-7.05
-6.07
-6.02
-4.88
-1.35
0.5
-0.86
-1.55

This is the expected result:

-4.72

This is what your code produced

-1.6

Any ideas?

I had made an edit to my post, please have a look

This worked.

for numb in "${elem}"/{1..100}02.pnt 
do
cat "$numb"
done |
awk '{v=$NF; sub(/[ \t]+[^ \t]*$/,x); A[$0]+=v; C[$0]++} END{for(i in A) printf "%s%7.1f\n",i,A/C}' > output