How to find the number of occurence of particular word from a text file?

sheela · May 8, 2014, 2:43am

example:

i have the following text file...
i am very tired.
i am busy
i am hungry

i have to find the number of occurence of a particular word 'am' from the text file.. can any one give the shell script for it

clx · May 8, 2014, 2:49am

If you want +1 for each line

grep -c am file

If you want +1 for each occurrence

grep -o am file | wc -l #with GNU grep

sheela · May 8, 2014, 2:53am

i cant understand u can u explain me..

clx · May 8, 2014, 3:01am

Did you try any/both the command?

sheela · May 8, 2014, 3:15am

i have one doubt...
i have 17000 lines in which i have to find the number of occurence of particular word 'another' is it possible to find with the command specified by u..

clx · May 8, 2014, 3:25am

You could try the commands. Even if it doesn't work, it won't crash your system at least.
Anyway, I meant

for sample file :

i am very tired. am
i am busy
i am hungr

grep -o am file | wc -l
4

grep -c am file
3

Hope its clear.

sheela · May 8, 2014, 3:29am

i have one doubt...
i have 17000 lines in which i have to find the number of occurence of particular word 'another' is it possible to find with the command specified by u..

---------- Post updated at 02:29 AM ---------- Previous update was at 02:26 AM ----------

thank u... i will try for those 17000 lines and will tell you whether i got the output...

protocomm · May 8, 2014, 5:01am

awk '{n=split ($0,tab)}{for(i=0;i<=n;i++){if(tab=="am") count++}};END{print count}' file

sheela · May 8, 2014, 5:13am

this is shell script or a simple command to find the occurence

protocomm · May 8, 2014, 5:36am

Yes it is a command line awk to find occurrence, the difference with script of clx is if there are several "am" in a line, i count it.

RudiC · May 8, 2014, 5:45am

With awk , try also

awk '{n+=gsub(/am/,"&")}END{print n}' file

sheela · May 8, 2014, 5:54am

thanks all i will try and reply if i have any doubts

rbatte1 · May 8, 2014, 7:53am

tr " " "\n" < input_file | grep -c search_word

You could add a -i flag to the grep if you want it case insensitive.

This will split the words on to separate lines. The search_word should really be explicit so you don't get false matches, e.g. looking for am, you could catch spam. Consider using "^am$" to show the beginning and end of the line. Some versions of grep allow you to specify a whole word match only, and that may help. What OS & version are you running with?

As an example:-

$ cat infile
I am not spam

$ tr " " "\n" | grep -i "am"
am
spam

$ tr " " "\n" | grep -i "^am$"
am

Robin

sheela · May 12, 2014, 12:01am

actually its centos i am working with... i am new to it but i need to find the occurence of the word "another" from 16000 lines in a log file

rbatte1 · May 12, 2014, 5:48am

Give it a try with:-

tr " " "\n" < your_log_file_name | grep -i "^another$"

Does this get you what you need?

I suppose you might have to consider another. , another, , another! , etc. too.

If this is a worry, try:-

tr "[:punct:]" " " < your_log_file_name | grep -i "^another$"

The plan with this one is to translate all punctuation to spaces, then translate all spaces to a new-line, then use grep to count the records (one word each by now) that contain the string from the first to the last character only.

It might be worth testing it out a small section first and think of as many variations as you can think of.

Robin

alister · May 12, 2014, 10:02am

Your use of split() is redundant. AWK already split the line. The value that split() returns is the current value of NF. You can iterate through the fields using i<=NF and checking each $i.

Also, if there aren't any matches, count will be undefined and the END print statement will output an empty line. I would change its argument to count+0.

Beware of matching substrings triggering false positives.

rbatte1:

Give it a try with:-
tr " " "\n" < your_log_file_name | grep -i "^another$"
...<snip>...
tr "[:punct:]" " " < your_log_file_name | grep -i "^another$"
The plan with this one is to translate all punctuation to spaces, then translate all spaces to a new-line, then use grep to count the records (one word each by now) that contain the string from the first to the last character only.

That's seems to me to be a reasonable approach, but neither of those pipelines actually implements it. As described, the approach would require a pipeline with two tr's. However, it can be done with one if you convert punctuation directly to newlines, which would be equivalent. In that case, you can modify your latter suggestion to:

tr '[:punct:] ' '[\n*]' < your_log_file_name | grep -i "^another$"

[/code]
Note the space after the punctuation character class. If one wanted to include any blank characters, the :blank: class could have been used instead.

Often times it's easier and safer to define what to include than what to exclude. Based on your approach, if we define a word as a sequence of [:alpha:] characters, the following portable solution can be used:

tr -sc '[:alpha:]' '[\n*]' | grep -Fixc word

Regards,
Alister

rbatte1 · May 12, 2014, 11:17am

Thanks for pointing out my logical error. :o

Perhaps I should have gone with:-

tr "[:punct:]" " " < your_log_file_name | tr "[:blank:]" "\n" | grep -i "^another$"

.
.
I am a little confused by your suggestion to use the :alpha: class. Would this not act on the characters we want to preserve?

I got some odd output from a quick test:-

# echo "Hello world!" | tr "[:alpha:]" "\n*"
***** *****!
# echo "Hello world!" | tr "[:alpha:]" "\n"
echo "Hello world!" | tr "[:alpha:]" "\n" 





 




!

Am I missing something?

I could bunch it up into a single tr too. A quick test shows this:-

# echo "This is what I am, I am not spam I hope." | tr "[:punct:][:blank:]" "\n"|grep -c "^am$"
2

That would consolidate my suggestion to:-

tr "[:punct:][:blank:]" "\n" < your_log_file_name | grep -i "^another$"

Robin

alister · May 12, 2014, 1:08pm

rbatte1:

I am a little confused by your suggestion to use the :alpha: class. Would this not act on the characters we want to preserve?

I got some odd output from a quick test:-
# echo "Hello world!" | tr "[:alpha:]" "\n*"
***** *****!
# echo "Hello world!" | tr "[:alpha:]" "\n"
echo "Hello world!" | tr "[:alpha:]" "\n" 





 




!
Am I missing something?

Two things. First, the -c option, which complements [:alpha:], so what is matched is everything that is not a member of [:alpha:].

Second, \n* must be bracketed. With the brackets, it represents as many newlines as it takes to match the length of the class in the previous argument. Without the brackets, it means a single newline followed by as many asterisks as it takes. So, your erroneous version would replace the first character in [:alpha:] with a newline and every subsequent character with an asterisk.

In my current locale, A is the first member of [:alpha:]. Note how in the first example 'A' is converted to a newline while 'a' becomes an asterisk:

$ printf aAa | tr '[:alpha:]' '\n*' | od -c
0000000   *  \n   *
0000003
$ printf aAa | tr '[:alpha:]' '[\n*]' | od -c
0000000  \n  \n  \n
0000003

With modern tr implementations, you can probably get away with simply using \n:

$ printf aAa | tr '[:alpha:]' '\n' | od -c
0000000  \n  \n  \n
0000003

... but there is the possibility of a portability issue. From POSIX tr:

Regards,
Alister

sheela · May 13, 2014, 12:01am

Hai alister,
can u please say me the exact command that i can use to find the particular word 'another' from 17000 lines from a log file or text file

clx · May 13, 2014, 8:49am

Again, Does it really take hard to try at least one solution yourself?
If you can not access your system now, please come back to us whenever you try.