Awk: count unique elements in a field and sum their occurence across the entire file

Hi,

Sure it's an easy one, but it drives me insane.

input ("|" separated):

1|A,B,C,A
2|A,D,D
3|A,B,B

I would like to count the occurence of each capital letters in $2 across the entire file, knowing that duplicates in each record count as 1.
I am trying to get this output (tab-separated; does not matter if sorted or not by $1):

A   3
B   2
C   1
D   1

What I tried so far:

gawk '
BEGIN{FS="|";OFS="\t"}
{
   a=split($2,b,",")
   for(i=1;i<=a;i++){
      if(count[b]++==0){
         total[b]++
      }
   }
}
END{
   for(k in total){
      print k "\t" total[k]
   }
}

But I get:

A   1
B   1
C   1
D   1

There is a conflict between the array 'count' (to count the letter only once per field per record) and the array 'total' (to sum up the number of letter in the file).

For my convenience, I rearranged / -idented your code, and added the delete count statement:

awk -F"|" -vOFS="\t" '

        {a = split ($2, b, ",")
         for (i=1; i<=a; i++)   if (!count[b]++)     total[b]++
         delete count
        }

END     {for (k in total)       print k "\t" total[k]
        }
' file
A    3
B    2
C    1
D    1

Looks promising!

1 Like

I did not know the 'delete' function. It's great !

I did not know neither the form '!count[b[i]]++' instead of 'count[b[i]]++==0'

Thanks !

You are using gawk so it was quite safe to assume it provides delete . Other awk versions don't necessarily, use split ("", ARR) then to clear array ARR .
awk and other (not all!) languages use integers to represent logical values: 0 and FALSE are interchangeable, as are 1 and TRUE (in fact, any non-null value will be treated as TRUE). Some people don't regard this "trick" justifiable...

1 Like
awk '
{for (i=2; i<=NF; i++) if (! line_fields[$i, NR]++) field_count[$i]++ }
END { for (i in field_count) print i "\t" field_count }
' FS="[|,]" input

Very useful information. Thanks !