After splitting $2 in an array, I am trying to store the number of unique elements in a variable, but have some difficulties resetting the variable to 0 before processing a new record.
I tried several variants of the following code, but it only works for the first record (all other record take into account the occurrence of the previous line(s)).
gawk '
{ VAR=0
split($2,b,",")
for(i in b)
if(!(b in s)) {
VAR++
s
[b]}
print $1 "\t" VAR
for(i in s)
delete s
}' input.tab
I don't have gawk on my system, but it works with a standard awk ( /usr/xpg4/bin/awk or nawk on a Solaris system; awk on most other systems). With gawk and some other versions of awk you should be able to replace:
for(i in s)
delete s
with:
delete s
but the standards don't yet require this to work in all conforming versions of awk .
@Scrutinizer: VEEERY interesting approach! Brilliant! At least the first one. The second will count wrongly if more than duplicates occur - C[F] will deduct 1 for the first duplicate, 2 for the third occurrence, etc. Might not be what was required?
Hi RudiC, you were a bit too fast Forgot the parentheses and the comparison, which I have corrected in the mean time..
--edit--
actually the parentheses are not needed..
--edit--
putting them back in to avoid ambiguity...