figure out the number different values a certain row takes

grossgermany · December 10, 2008, 1:00pm

Hi,

If we are concerned with each column,usually this is very easy using the wc command in shell, however,the problem is:
I have file with 8 rows but 100 columns.
1.I want to find the number of different values a certain "row" takes
2.The values are in fact characters for, example, "john", "david,"steve"
3.Sometimes the value is empty, when the value is empty, I want to ignore it completely when I am trying to figure out the number different values a certain row takes.

Seems awk is the most efficient tool.

Thanks,
john

radoulov · December 10, 2008, 2:32pm

Could you post a small sample (two rows for example) of your data?
The solution will depend on your definition of empty (,, or ,"",).

You may consider something like this with Perl:

perl -F, -lane'
  map { $_{$_} = 1 if /\w/ } @F;
    print scalar keys %_  
  ' infile

grossgermany · December 10, 2008, 4:29pm

class1 john,alax,alex,wong, , ,paul,john
class2 mary,adel,yvone,mary, ,iona,wayne

so class1 has 4 different names
class2 has 5 different names

I have no idea with perl
Would you please you do it using awk or python?

joeyg · December 10, 2008, 5:06pm

> cat file105
class1 john,alax,alex,wong, , ,paul,john 
class2 mary,adel,yvone,mary, ,iona,wayne 

> cat manip105.sh
while read mydata
   do
   myclass=$(echo "$mydata" | cut -d" " -f1)
   mynames=$(echo "$mydata" | cut -d" " -f2-)
   uniqnames=$(echo "$mynames" | tr "," "\n" | sort -u | grep "^[A-Za-z]" | wc -l)
   echo "${myclass} has ${uniqnames} unique names"
done< file105

> manip105.sh 
class1 has 5 unique names
class2 has 5 unique names
>

Note that class 1 truly has five unique names

radoulov · December 11, 2008, 4:36am

AWK:

awk -F, '{
  cl = $1; sub(/ .*/, "", cl)
  sub(/[^ ]* /,"")
  while (++i <= NF) 
    $i ~ /^ *$/ || _[$i]++ || c++
  printf "%s has %d unique names\n",
    cl, c
  i = c = 0; split("",_)
  }' infile

Attempt with Python:

[script]

#! /usr/bin/env python

import fileinput, re

p = re.compile('^\s*$')

for l in fileinput.input():
	l = l.rstrip()
	l = l.split(',')
	cl, l[0] = l[0].split()[:2]
	u, c = {}, 0
	for r in l:
		u[r] = 1
	for k in u.iterkeys():
		if not p.match(k):
			c = c + 1	
	print cl, "has", c, "unique names"

Use scriptname infile to execute.