counts the number of distinct words

Net-Man · August 17, 2008, 8:29pm

I'm looking to write a sample shell script that counts the number of distinct words in a text file given as Argument.
Remark: White space characters are spaces, tabs, form feeds, and new lines.

JUST with this commands tr, sort, grep. wc.

Thanks.

Annihilannic · August 17, 2008, 8:49pm

It sounds like you already know which commands you need to use. Where are you stuck? Have you read the man pages for those commands?

Net-Man · August 17, 2008, 9:00pm

Actully yes but i don't know how i can use it to give me distinct words.

please if you can just guide me how i can do it. thank you

Annihilannic · August 17, 2008, 9:10pm

Difficult to "guide" you how to do it without just telling you how to do it.... and that way you won't learn anything!

Try using tr to strip out all punctuation (see the -d option), then using tr again to convert all spaces to carriage returns and all upper-case characters to lower-case. Then you can sort the output using the unique option (see the man page) so that you end up with only distinct words, and then count the number of lines produced using wc.

summer_cherry · August 17, 2008, 9:37pm

perl:

$file=shift;
open(FH,"<file");
while(<FH>){
	@arr=split(" ",$_);
	for($i=0;$i<=$#arr;$i++){
		$hash{$arr[$i]}++;
	}
}
close(FH);
for $key (keys %hash){
	print $key,"--->",$hash{$key},"\n";
}

vidyadhar85 · August 17, 2008, 10:15pm

just pass those agrument to grep -c

scriptname abc efg
grep -c "$1" filename---this will give the count of abc in file
similarly do for remainning...

Annihilannic · August 17, 2008, 10:22pm

I don't think you understood the question vidyadhar85.

vidyadhar85 · August 17, 2008, 10:30pm

can you explain please

Annihilannic · August 17, 2008, 10:42pm

He wants to count the number of "distinct" (i.e. unique) words... e.g. if the words were "Blah yak blah blah yak rhubarb" the answer would be 3 ("blah yak rhubarb").

jim_mcnamara · August 18, 2008, 5:55am

tr -s '[:space:]' '\n' < infile | sort -u | wc -l

The way the question is posed it looks like homework. But I cannot be sure. Don't post homework questions, please.

redoubtable · August 18, 2008, 6:17am

With that limitation (grep, tr, sort, wc) definitely looks like homework.

If you could use xargs:

xargs -a FILENAME -n 1|sort -u|wc -l

Net-Man · August 18, 2008, 1:16pm

My friends this is not HW it's just challenge between my friends
Anyway i try to solve it:

grep -c | sort -u test.txt | tr -d "\t \v \f [unct:] [:upper:] "

but actually i don't know how i can sort this commands ?? so anybody can correct it for me.

Thanks everyone replayed to me.

era · August 18, 2008, 1:27pm

What's the grep for? It doesn't do anything without at least a regular expression to search for, and usually also a file name.

Jim' solution already solved your problem; did you really not try the solutions posted here?

# Change any whitespace into a newline
tr '\t\v\f ' '\n' <test.txt |
# sort, deleting any repeated occurrences of the same word
sort -u |
# count how many we have
wc -l

This is very much a staple of introductory Unix text books; I'd recommend that you finish the first chapter before you accept any more challenges.

amsct · August 18, 2008, 1:30pm

Something like this:

for word in $(cat $filename)
do
echo $word|tr -d [:punc:]|tr [:upper:] [:lower:] 
done|sort |uniq -c

Just like Annihilannic suggested.

sudhamacs · August 18, 2008, 2:26pm

echo `cat file` | tr '[:lower:]' '[:upper:]' | tr '\t\v\f ' '\n' | sort -u | wc -l