awk unique count of partial match with semi-colon

cmccabe · June 2, 2016, 3:04pm

Trying to get the unique count of the below input , but if the text in beginning of $5 is a partial match to another line in the file then it is not unique.

awk

awk '!seen[$5]++ {n++} END {print n}' input
7

input

chr1    159174749    159174770    chr1:159174749-159174770    ACKR1
chr1    159175223    159176240    chr1:159175223-159176240    ACKR1
chr2    149225899    149228040    chr2:149225899-149228040    AK025127;MBD5
chr2    200213413    200213906    chr2:200213413-200213906    AK025127;SATB2
chr3    196050574    196050878    chr3:196050574-196050878    AK124973;TM4SF19;TM4SF19-TCTEX1D2
chr10    5042568    5042687    chr10:5042568-5042687    AKR1C2
chr10    5043696    5043883    chr10:5043696-5043883    AKR1C2
chr10    5043695    5043883    chr10:5043695-5043883    AKR1C2;AKR1C3

desired output (correct count) 4 since $5 in line 1 and 2 are the same, $5 in line 3 and 4 are the same and $5 in line 6,7,8 are the same. I can only seem to count each line and the ; is causing problems, but I can not seem to fix it. Thank you :).

RudiC · June 2, 2016, 3:20pm

$5 in line 3 and 4 are NOT the same, nor in line 6, 7, and 8. Add split ($5, T, ";"); and then use T[1] .

Scrutinizer · June 2, 2016, 4:00pm

Perhaps like so? Modifying your post:

awk '{split($5,F,/;/)} !seen[F[1]]++ {n++} END {print n}' file
4

-- edit --
Ow RudiC already gave the exact same answer...