question about wc

yodadbl07 · May 8, 2006, 3:19am

Hey my friend was asking me if i knew a way to cout how many different words in a file. I told him no not off hand, but i was thinking about it, and i started to wonder also. I imagine this is probably pretty simple im just missing something, I keep confusing my self with how you would compair and filter out the same words twice or more. If any one knows of a way to do this id like to know.....

thanks

gauravgoel · May 8, 2006, 5:43am

may sound weird but give it a try,

1) get all the words in a single column. example say your file is

a b df sd ff d

make it
a
b
df
sd
ff
.
.

2) sort the output of step 1
3) use uniq command on the output of step2

yodadbl07 · May 8, 2006, 5:58am

oh i see, i didnt know about the uniq command, thanks a lot!

gauravgoel · May 8, 2006, 6:02am

post back if you get something, would like to see if it really works out.
theoretically it should work fine

mrn · May 8, 2006, 6:09am

sort -u filename|wc

gauravgoel · May 8, 2006, 6:18am

dont thinks this will work

mrn · May 8, 2006, 6:30am

Works on AIX / Solaris / HP & Redhat

gauravgoel · May 8, 2006, 6:38am

I know about the sort command with -u option and that it works

What i meant to say that the command which you have given is not the total solution for the question posted above. But that is again what i think, you may be right as well.

This is what I got from your command

but as per my understanding the output should have been 10 and not 12

Gaurav

yodadbl07 · May 10, 2006, 3:43am

ok i was working with the first idea. here is what i have so far. but for some reason this is not working, the command works but i think there is a problem with the input.

#!/bin/csh

echo "Please enter a filename: "
set filename = $<

set dif = `tr -d '.:"$(),-' < $filename | tr '[A-Z]' '[a-z]' | tr ' ' '\n' | sort | uniq | wc -l`
set num = `wc -l`

echo "Thank you, your file has $num words and $dif different words."

echo " "

maybe someone can catch it..

gauravgoel · May 10, 2006, 3:46am

one problem at the first glance

where is thwe filename mate

yodadbl07 · May 10, 2006, 3:47am

i want the user to be able to input the filename, then run the command on that filename

gauravgoel · May 10, 2006, 3:49am

got that but the thing is in the command you have wriiten in the script to set num you have forgotten to mention the name of the file

set num = `wc -l filename`

yodadbl07 · May 10, 2006, 3:50am

oh yah duh!! lol

yodadbl07 · May 10, 2006, 4:01am

Ok this is the output im getting, why is the echo statement messed up.
Please enter a filename:
testfile.txt
words.ou, your file has 239 words and 159

heres the code:

#!/bin/csh

echo "Please enter a filename: "

set filename = $<

set num = ` wc -w $filename | awk '{ print $1 } ' `
set dif = `tr -d '.:"(),-' < $filename | tr '[A-Z]' '[a-z]' | tr ' ' '\n' | sort | uniq | wc -l`

echo "Thank you, your file has $num words and $dif different words."

echo " "

vino · May 10, 2006, 4:33am

So many tr's, pipes, combination of sort+uniq.

See if this works.

tr -cs '[:alnum:]' '[\n*]' < test.c | sort -u | wc -l

yodadbl07 · May 10, 2006, 4:49am

 libra% tr -cs '[:alnum:]' '[\n*]' < testfile.txt | sort -u | wc -l
     165

vs.

libra% tr -d '.:"(),-' < testfile.txt | tr '[A-Z]' '[a-z]' | tr ' ' '\n' | sort | uniq | wc -l
     159

what does the

do?

vino · May 10, 2006, 4:58am

'[:alnum:]' doesnt do anything. Rather, it is a character class for alphabets and numbers.

See info -f coreutils --index-search='Character classes'