Awk: count unique elements in a field and sum their occurence across the entire file

beca123456 · April 18, 2018, 6:06am

Hi,

Sure it's an easy one, but it drives me insane.

input ("|" separated):

1|A,B,C,A
2|A,D,D
3|A,B,B

I would like to count the occurence of each capital letters in $2 across the entire file, knowing that duplicates in each record count as 1.
I am trying to get this output (tab-separated; does not matter if sorted or not by $1):

What I tried so far:

gawk '
BEGIN{FS="|";OFS="\t"}
{
   a=split($2,b,",")
   for(i=1;i<=a;i++){
      if(count[b]++==0){
         total[b]++
      }
   }
}
END{
   for(k in total){
      print k "\t" total[k]
   }
}

But I get:

There is a conflict between the array 'count' (to count the letter only once per field per record) and the array 'total' (to sum up the number of letter in the file).

RudiC · April 18, 2018, 6:23am

For my convenience, I rearranged / -idented your code, and added the delete count statement:

awk -F"|" -vOFS="\t" '

        {a = split ($2, b, ",")
         for (i=1; i<=a; i++)   if (!count[b]++)     total[b]++
         delete count
        }

END     {for (k in total)       print k "\t" total[k]
        }
' file
A    3
B    2
C    1
D    1

Looks promising!

beca123456 · April 18, 2018, 6:33am

I did not know the 'delete' function. It's great !

I did not know neither the form '!count[b[i]]++' instead of 'count[b[i]]++==0'

Thanks !

RudiC · April 18, 2018, 6:49am

You are using gawk so it was quite safe to assume it provides delete . Other awk versions don't necessarily, use split ("", ARR) then to clear array ARR .
awk and other (not all!) languages use integers to represent logical values: 0 and FALSE are interchangeable, as are 1 and TRUE (in fact, any non-null value will be treated as TRUE). Some people don't regard this "trick" justifiable...

rdrtx1 · April 18, 2018, 8:27am

awk '
{for (i=2; i<=NF; i++) if (! line_fields[$i, NR]++) field_count[$i]++ }
END { for (i in field_count) print i "\t" field_count }
' FS="[|,]" input

beca123456 · April 18, 2018, 9:34am

Very useful information. Thanks !