Awk: count unique element of array

beca123456 · January 19, 2017, 7:11pm

Hi,

tab-separated input:

blabla_1 A,B,C,C
blabla_2 A,E,G
blabla_3 R,Q,A,B,C,R,Q

output:

blabla_1 3
blabla_2 3
blabla_3 5

After splitting $2 in an array, I am trying to store the number of unique elements in a variable, but have some difficulties resetting the variable to 0 before processing a new record.

I tried several variants of the following code, but it only works for the first record (all other record take into account the occurrence of the previous line(s)).

gawk -F '\' '
{
   VAR=0

   a=split($2,b,",")
   
   for(i=1; i<=a; i++){
      if(!c[b]++){
         VAR+=1
      }
   }
   
   print $1 "\t" VAR
}' input.tab

Returns:

blabla_1 3
blabla_2 2
blabla_3 2

Don_Cragun · January 19, 2017, 7:43pm

Maybe something more like:

gawk '
{	VAR=0
	split($2,b,",")
	for(i in b)
		if(!(b in s)) {
			VAR++
			s
		[b]}
	print $1 "\t" VAR
	for(i in s)
		delete s
}' input.tab

I don't have gawk on my system, but it works with a standard awk ( /usr/xpg4/bin/awk or nawk on a Solaris system; awk on most other systems). With gawk and some other versions of awk you should be able to replace:

	for(i in s)
		delete s

with:

	delete s

but the standards don't yet require this to work in all conforming versions of awk .

RudiC · January 20, 2017, 4:41am

Slightly different approach:

awk -F"[ ,]" '
        {for (i=2; i<=NF; i++) {if (!T[$i]++) C++}
         print $1, C
         C = 0
         split (_,T)
        }
' file
blabla_1 3
blabla_2 3
blabla_3 5

Scrutinizer · January 20, 2017, 5:22am

Slightly different approach still:

awk '{split(x,C); n=split($2,F,/,/); for(i in F) if(C[F]++) n--; print $1, n}' file

--
more concise:

awk '{split(x,C); n=split($2,F,/,/); for(i in F) n-=(C[F]++>0); $2=n}1' file

RudiC · January 20, 2017, 5:29am

@Scrutinizer: VEEERY interesting approach! Brilliant! At least the first one. The second will count wrongly if more than duplicates occur - C[F] will deduct 1 for the first duplicate, 2 for the third occurrence, etc. Might not be what was required?

Scrutinizer · January 20, 2017, 5:31am

Hi RudiC, you were a bit too fast Forgot the parentheses and the comparison, which I have corrected in the mean time..
--edit--
actually the parentheses are not needed..
--edit--
putting them back in to avoid ambiguity...

beca123456 · January 23, 2017, 1:03pm

Thanks guys !
Everything works great !