awk to print array that occurs the most with matching value in another field

cmccabe · June 15, 2017, 10:30am

In the below awk I am splitting $7 on the : and then counting each line or NM_xxxx . If the $1 value is the same for each line then print the $7 that occurs the most with the matching $1 value. The awk seems close but I am not sure what is going on. I included a description as well as to what I think is going on. Thank you :).

awk

awk -F'[\t:]' '{count[$1 "\t" $7]++} END {for (word in count) print word, count[word]}' file

description

awk -F'[\t:]'   ---- regex for FS `\t` and split `:`
'{count[$7]++}  ---- count each `line in $7` and read into array count
{for (word in count)   ---- start loop using array count and read each line in array word
print $1, word, count[word]}    ---- print desired fields `$1, [word] (I only printed count[word] to confirm, it is not needed)

file

A2M 2   18171   33210   coding  na  NM_000014.5:c.2998A>G   c.2998A>G
A2M 2   18172   33211   coding  na  NM_000014.5:c.2915G>A   c.2915G>A
A2M 2   18173   33212   coding  na  NM_000014.4:c.2125+1_2126-1del  c.2125+1_2126-1del
A2M 2   18174   33213   coding  na  NM_000014.5:c.2111G>A   c.2111G>A
A2M 2   402328  390084  coding  na  NM_000014.5:c.2126-6_2126-2delCCATA
A4GALT  53947   2692    17731   coding  na  NM_017436.5:c.548T>A    c.548T>A
A4GALT  53947   2693    17732   coding  na  NM_017436.5:c.752C>T    c.752C>T
A4GALT  53947   2694    17733   coding  na  NM_017436.6:c.783G>A    c.783G>A
A4GALT  53947   2695    17734   coding  na  NM_017436.6:c.560G>A    c.560G>A
A4GALT  53947   2696    17735   coding  na  NM_017436.6:c.240_242delCTT
A4GALT  53947   2697    17736   coding  na  NM_017436.6:c.1029dupC  c.1029dupC
A4GALT  53947   39437   48036   coding  na  NM_017436.6:c.631C>G    c.631C>G

current output

A2M	NM_000014.4 1
A2M	NM_000014.5 4
	 3
A4GALT	NM_017436.5 2
A4GALT	NM_017436.6 5

desired output

A2M NM_000014.5
A4GALT NM_017436.6

rdrtx1 · June 15, 2017, 12:03pm

awk '
{if (++c[$7 ":" $8] > c[$7]) {c[$7]=c[$7 ":" $8] ; o[$7]=$1 " " $7 "." $8}}
END {
   for (i in o) print o;
}
' FS="[\t.:]" infile