Awk: count unique element of array

Hi,

tab-separated input:

blabla_1 A,B,C,C
blabla_2 A,E,G
blabla_3 R,Q,A,B,C,R,Q

output:

blabla_1 3
blabla_2 3
blabla_3 5

After splitting $2 in an array, I am trying to store the number of unique elements in a variable, but have some difficulties resetting the variable to 0 before processing a new record.

I tried several variants of the following code, but it only works for the first record (all other record take into account the occurrence of the previous line(s)).

gawk -F '\' '
{
   VAR=0

   a=split($2,b,",")
   
   for(i=1; i<=a; i++){
      if(!c[b]++){
         VAR+=1
      }
   }
   
   print $1 "\t" VAR
}' input.tab

Returns:

blabla_1 3
blabla_2 2
blabla_3 2

Maybe something more like:

gawk '
{	VAR=0
	split($2,b,",")
	for(i in b)
		if(!(b in s)) {
			VAR++
			s
		[b]}
	print $1 "\t" VAR
	for(i in s)
		delete s
}' input.tab

I don't have gawk on my system, but it works with a standard awk ( /usr/xpg4/bin/awk or nawk on a Solaris system; awk on most other systems). With gawk and some other versions of awk you should be able to replace:

	for(i in s)
		delete s

with:

	delete s

but the standards don't yet require this to work in all conforming versions of awk .

1 Like

Slightly different approach:

awk -F"[ ,]" '
        {for (i=2; i<=NF; i++) {if (!T[$i]++) C++}
         print $1, C
         C = 0
         split (_,T)
        }
' file
blabla_1 3
blabla_2 3
blabla_3 5

Slightly different approach still:

awk '{split(x,C); n=split($2,F,/,/); for(i in F) if(C[F]++) n--; print $1, n}' file

--
more concise:

awk '{split(x,C); n=split($2,F,/,/); for(i in F) n-=(C[F]++>0); $2=n}1' file

@Scrutinizer: VEEERY interesting approach! Brilliant! At least the first one. The second will count wrongly if more than duplicates occur - C[F] will deduct 1 for the first duplicate, 2 for the third occurrence, etc. Might not be what was required?

1 Like

Hi RudiC, you were a bit too fast :slight_smile: Forgot the parentheses and the comparison, which I have corrected in the mean time..
--edit--
actually the parentheses are not needed..
--edit--
putting them back in to avoid ambiguity...

Thanks guys !
Everything works great !