In the below awk
I am splitting $7
on the :
and then counting each line or NM_xxxx
. If the $1
value is the same for each line then print the $7
that occurs the most with the matching $1
value. The awk seems close but I am not sure what is going on. I included a description as well as to what I think is going on. Thank you :).
awk
awk -F'[\t:]' '{count[$1 "\t" $7]++} END {for (word in count) print word, count[word]}' file
description
awk -F'[\t:]' ---- regex for FS `\t` and split `:`
'{count[$7]++} ---- count each `line in $7` and read into array count
{for (word in count) ---- start loop using array count and read each line in array word
print $1, word, count[word]} ---- print desired fields `$1, [word] (I only printed count[word] to confirm, it is not needed)
file
A2M 2 18171 33210 coding na NM_000014.5:c.2998A>G c.2998A>G
A2M 2 18172 33211 coding na NM_000014.5:c.2915G>A c.2915G>A
A2M 2 18173 33212 coding na NM_000014.4:c.2125+1_2126-1del c.2125+1_2126-1del
A2M 2 18174 33213 coding na NM_000014.5:c.2111G>A c.2111G>A
A2M 2 402328 390084 coding na NM_000014.5:c.2126-6_2126-2delCCATA
A4GALT 53947 2692 17731 coding na NM_017436.5:c.548T>A c.548T>A
A4GALT 53947 2693 17732 coding na NM_017436.5:c.752C>T c.752C>T
A4GALT 53947 2694 17733 coding na NM_017436.6:c.783G>A c.783G>A
A4GALT 53947 2695 17734 coding na NM_017436.6:c.560G>A c.560G>A
A4GALT 53947 2696 17735 coding na NM_017436.6:c.240_242delCTT
A4GALT 53947 2697 17736 coding na NM_017436.6:c.1029dupC c.1029dupC
A4GALT 53947 39437 48036 coding na NM_017436.6:c.631C>G c.631C>G
current output
A2M NM_000014.4 1
A2M NM_000014.5 4
3
A4GALT NM_017436.5 2
A4GALT NM_017436.6 5
desired output
A2M NM_000014.5
A4GALT NM_017436.6