awk to calculate fields only if match is found

Trying to combine the matching $5 values between file1 and file2 . If a match is found then the last $6 value in the match and the sum of $7 are outputted to a new file. The awk below I hope is a good start. Thank you :).

file1

chr12    9221325    9221448    chr12:9221325-9221448    A2M    1    125
chr12    9221325    9221448    chr12:9221325-9221448    A2M    2    125
chr12    9221800    9221999    chr12:9221800-9221999    A2M1    1    120
chr12    9221800    9221999    chr12:9221800-9221999    A2M1    2    120
chr12    9221800    9221999    chr12:9221800-9221999    A2M1    3    130

file2

chr12    9221325    9221448    chr12:9221325-9221448    A2M    1    120
chr12    9221325    9221448    chr12:9221325-9221448    A2M    2    120
chr12    9221800    9221999    chr12:9221800-9221999    A2M1    1    125
chr12    9221800    9221999    chr12:9221800-9221999    A2M1    2    125
chr12    9221800    9221999    chr12:9221800-9221999    A2M1    3    135

awk tried with error

awk '{sum[$4]+=$6; count[$7]++} 
>     END{for(k in sum) printf "%s %.1f\n",  k, sum[k]/count[k]}' file1 file2 > out
awk: cmd. line:2: (FILENAME=file2 FNR=5) fatal: division by zero attempted

desired output

chr12    9221325    9221448    chr12:9221325-9221448    A2M    2    250
chr12    9221800    9221999    chr12:9221800-9221999     A2M1    3    385 

Hello cmccabe,

Could you please try following and let me know if this helps.
1st: If you need to have exact count like file2 of 2nd last field(where I am assuming like it could be anything/any number, though your Input_file shows only in increment order or each occurrence of the field 4th). So always it will print the latest value of the 4th field's 2nd last field here, without taking care what it is.

awk 'FNR==NR{A[$4];next} ($4 in A){Q=$NF;P=$(NF-1);$(NF-1)=$NF="";E[$4]=$0 OFS P OFS (W[$4]+=Q)} END{for(i in E){print E}}' file1 file2

Output will be as follows.

chr12 9221325 9221448 chr12:9221325-9221448 A2M   2 240
chr12 9221800 9221999 chr12:9221800-9221999 A2M1   3 385

2nd: In case your 2nd last field shows the number of occurrences of 4th field, then following could help you.

awk 'FNR==NR{A[$4];next} ($4 in A){Q=$NF;$(NF-1)=$NF="";++S[$4];E[$4]=$0 OFS S[$4] OFS (W[$4]+=Q)} END{for(i in E){print E}}' file1 file2

Output will be as follows.

chr12 9221325 9221448 chr12:9221325-9221448 A2M   2 240
chr12 9221800 9221999 chr12:9221800-9221999 A2M1   3 385

Now coming on to the confusion which I have by reading your post, if you see carefully you have shown Output_file's last field(which is SUM of the last fields), first line it shows 250 so it looks you are taking sum from file1 BUT on same time second line shows 385 which shows SUM should be from file2. So above solutions are taking SUM of values in file2 not in file1. Please try above ones and let me know how it goes then.

Thanks,
R. Singh

1 Like

The first awk is always the case. The only thing that may change is the output (depending on the case). How can I print only $5, $6, $7 ? Thank you for your help, iworks great :).

A2M   2 250
A2M1   3 370

Maybe something like:

awk '
FNR == NR {
	a[$5]
	next
}
$5 in a {
	last6[$5] = $6
	sum7[$5] += $7
}
END {	for(i in last6)
		print i, last6, sum7
}' file[12]

will do what you want, producing the output:

A2M1 3 385
A2M 2 240

(Note that doing it this way the order of the output lines is random.)

And, as always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

1 Like

Hello cmccabe,

A little change in my previous code, could you please try following and let me know if this helps.

awk 'FNR==NR{A[$4];next} ($4 in A){Q=$NF;P=$(NF-1);$(NF-1)=$NF="";E[$4]=$5 OFS P OFS (W[$4]+=Q)} END{for(i in E){print E}}' file1 file2

Output will be as follows.

A2M 2 240
A2M1 3 385

Thanks,
R. Singh

1 Like

Thank you both very much, works great :).