Count unique words

Dear all,

I would like to know how to list and count unique words in thousands number of text files.

Please help me out
thanks in advance

What have you tried?

Also, due to the vague nature of this request, it appears that this may be homework/classwork. If it is, there are specific rules relative to schoolwork.

2 Likes

Dear Joeyg

I have a list of thousands of text files like

3_March_2013_Front19.txt
10_May_2014_Page326.txt
5_October_2013_Sports36.txt
27_September_2010_Health314.txt
19_December_2012_Page316.txt
31_October_2012_Entertainment1094.txt
15_April_2013_Front14.txt
1_March_2013_Science&Technology33.txt
6_March_2012_MuslimWorld2.txt
19_October_2012_MuslimWorld4.txt
7_February_2012_International312.txt
23_August_2012_Front8.txt
24_July_2012_National22.txt
25_September_2012_Front20.txt
3_October_2014_Page35.txt

So, I would like to count the of total number and unique words for all files based on fourth field of the filename.

e.g.

if(filename==National)
count total and unique words

if(filename==International)
count total and unique words

if(filename==Health)
count total and unique words

and so on...

Please help me

How about sth along this line?

for FN in *.txt
  do    TMP=${FN##*_}
        TMP=${TMP%%[0-9]*}
        echo "if(filename==$TMP)"
        echo count total and unique words
        echo
  done
if(filename==Page)
count total and unique words

if(filename==Front)
count total and unique words

.
.
.

Please note that your pseudo code is not somewhere near any real code doing what you seem to describe.

1 Like

Actually Sir, I like to count words all those files whose filename contains National, Page, International, Health & Entertainment etc.

So - how would you do that? And how would you handle the results?

(Apologies for any typos.)
RudiC has already given you a starter with this, assume 'FN' is pointing to an Entertainment text file:-

FN='31_October_2012_Entertainment1094.txt'
TMP=${FN##*_}
TMP=${TMP%%[0-9]*}

This would give you a result inside the TMP variable, Entertainment .

So your logic would require a count for each file containing 'Entertainment'.
Similarly for the others.

So what would your logic be to obtain your count(s) per category?

You are here to learn how to do it for yourself and the best way is to attempt something no matter how bad your code looks. We are not here to ridicule your attempts but to correct your logic so that you understand what is going on and become capable of doing it again if need be.
If it is JUST the filenames you want then this will _perhaps_ help:-
ls *.txt > /your/path/to/filenames which will create a single text file with your thousands of filenames ONLY inside it.
grep is your friend here.

However if you intend to read EACH individual file to count these words also, then this is a totally different _animal_.

tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 | sed ${1:-100}q

May you try this one, it is not my creation, but it worked for my purposes to get the most frequent one hundred words in a file. You can adapt the value 100 to any other number.

Not clear to me -

you want to read the files and count total unique words in the text inside the files?
You want word frequencies like this

1033 cow
999   the
998   family

If those files are large consider doing something else while this code runs.

awk ' {
       $0=tolower($0)
       for (i=1; i<=NF; i++) {arr[$(i)]++}
       }
       END {for (i in arr) {print arr, i  }}
     '  *National*  *International* *Health*  > wordcount.txt
     # this gives the top 500 most common words.  Work with the output file
     # to get what you want.
     sort -k1n wordcount.txt | tail -500

Some of the replies answer a different question, it seems to me. So I am not sure if this is what you want.

This is a ahell example that searches my $HOME for 'Scope' inside filenames containing the characters 'Scope' then each individual file that contains the same...
OSX 10.12.3, default terminal calling 'sh'...

Last login: Mon Feb 27 20:01:48 on ttys000
AMIGA:barrywalker~> cd Desktop/Code/Shell
AMIGA:barrywalker~/Desktop/Code/Shell> cat search.sh
#!/bin/sh
# search.sh $1
ls "$HOME"/*Sc* > /tmp/listing
echo "Do the 'grep -c \"$1\" /tmp/listing' file for $1..."
grep -c "$1" /tmp/listing
echo "Now do the same for each individual file."
while read -r line
do
	echo "Inside file $line."
	grep -c "$1" "$line"
done < /tmp/listing
AMIGA:barrywalker~/Desktop/Code/Shell> 
AMIGA:barrywalker~/Desktop/Code/Shell> 
AMIGA:barrywalker~/Desktop/Code/Shell> ./search.sh Scope
Do the 'grep -c "Scope" /tmp/listing' file for Scope...
3
Now do the same for each individual file.
Inside file /Users/barrywalker/AudioScope.Manual.
62
Inside file /Users/barrywalker/AudioScope.config.
0
Inside file /Users/barrywalker/AudioScope.sh.
103
AMIGA:barrywalker~/Desktop/Code/Shell> _