How to count number of results found?

demmel · February 18, 2013, 5:23pm

Hi guys,

I'm struggling with this one, any help is appreciated.

I have File1 with hundreds of unique words, like this:

word1
word2
word3

I want to count each word from file1 in file2 and return how many times each word is found.

I tried something like this:

for i in $(cat file1); do echo -n "$i ";tr -s ' ' '\n' < file2| grep -c "$i ";done > temp03

but its returning zero for every word even if it does exist.

Can you please help?

Thanks a lot!

Yoda · February 18, 2013, 5:51pm

Try something like:

while read word
do
        c=$( tr ' ' '\n' < file2 | grep -c "${word}" )
        echo "${word} ${c}"
done < file1

gary_w · February 18, 2013, 5:54pm

#!/bin/ksh

FILE1=x1.dat
FILE2=x2.dat

while read word
do
  # -w option to grep for searching on a word boundary.
  count=$( grep -wc "$word" $FILE2 )
  printf "%s: %d\n" "$word" $count
done < $FILE1

exit 0

$ cat x1.dat
word1
word2
word3
$ cat x2.dat
word1
word2
word3
word1
word2
word3oword1
word2
word2
word2
word3
word3
word3
word3 word3word3
$ ./x
word1: 2
word2: 5
word3: 5
$

I would advise against using tr , since that translation of the entire file is happening for each word in FILE1.

nithinsen · February 19, 2013, 4:42am

for i in $(cat file1); do echo -n "$i ";tr -s ' ' '\n' < file2| grep -c "$i ";done > temp03

change echo -n to echo.

RudiC · February 19, 2013, 6:50am

Using gary_w's files, would this satisfy your needs:

$ grep -of x1.dat x2.dat| sort |uniq -c
      3 word1
      5 word2
      8 word3

---------- Post updated at 12:50 ---------- Previous update was at 12:44 ----------

That wouldn't help. Remove the space in grep 's "$i " parameter...

gary_w · February 19, 2013, 8:48am

Note that RudiC's method counts the word if it is part of another word. I do not know if this is the desired result as the original spec did not get that detailed. Just sayin'.

RudiC · February 19, 2013, 10:16am

Well, then, use the -w switch to grep. This then would yield exactly gary_w's result. But I'm not sure that this will work on all grep versions.

demmel · March 27, 2013, 7:58pm

I actually prefer counting the words even if its between other words, so RudiC seems the best option, anyways all the replies helped me a lot, so thanks to all!

---------- Post updated at 06:49 PM ---------- Previous update was at 05:44 PM ----------

rudic:

Using gary_w's files, would this satisfy your needs:
$ grep -of x1.dat x2.dat| sort |uniq -c
   3 word1
   5 word2
   8 word3
---------- Post updated at 12:50 ---------- Previous update was at 12:44 ----------

That wouldn't help. Remove the space in grep 's "$i " parameter...

Unfortunately the grep -o is not installed in my server, and I cant do anything about it.

grep: illegal option -- o

Do you know if its possible to do something similar?

---------- Post updated at 08:58 PM ---------- Previous update was at 06:49 PM ----------

I got the result expected by using the following:

for i in $(cat x1.dat); do echo "$i ";tr -s ' ' '\n' < x2.dat| grep -c "$i";done

However, the result is coming up like this:

word1
3
word2
5
word3
8

But I expected to be like this:

3 word1
5 word2
8 word3

Can anyone help further?

Yoda · March 27, 2013, 8:57pm

Here is a KSH script using Associative Arrays for counting words:

#!/bin/ksh

typeset -A word_ARR

while read line
do
        for word in $line
        do
                (( word_ARR[$word]++ ))
        done
done < file.txt

for key in ${!word_ARR[*]}
do
        print ${word_ARR[$key]} $key
done

demmel · March 27, 2013, 10:44pm

yoda:

Here is a KSH script using Associative Arrays for counting words:

#!/bin/ksh

typeset -A word_ARR

while read line
do
   for word in $line
   do
   (( word_ARR[$word]++ ))
   done
done < file.txt

for key in ${!word_ARR
[*]}
do
   print ${word_ARR[$key]} $key
done

I may be doing something wrong, but I'm unable to get any results from this script.

I ran the same , only replacing the file input name.
Tried in 2 dif envs:

1-$ ./array
./array[3]: typeset: bad option(s)
2-$./array
bash: ./array: /bin/ksh: bad interpreter: No such file or directory

Any clue where is the problem?

hanson44 · March 28, 2013, 12:15am

The problem is that /bin/ksh apparently does not exist on your system.

Please try the following, assuming it runs on your system:

$ cat file1
word1
word2
word3

$ cat file2
word1
word2 word3
word2 word3 word3 word4

$ cat temp.sh
grep -f file1 file2 > good_lines
sed "s/\<[a-zA-Z0-9_]\+\>/&\n/g" good_lines > split_lines
grep -f file1 split_lines | sed "s/^ *//; s/ *$//" > matched_words
sort matched_words | uniq -c

$ ./temp.sh
      1 word1
      2 word2
      3 word3

I defined a "Word" as the standard [a-zA-Z0-9_].
So this includes "Words" with numbers and underscores.
Alternatively, you could use [a-zA-Z].
Or maybe you want to count "auto-correct" as one word.
In that case, [a-zA-Z-] would work.

Yoda · March 28, 2013, 12:21am

I forgot to mention that you require KSH93 to support this code.

KSH88 does not support typeset option -a to define arrays.

demmel · March 28, 2013, 5:59pm

hanson44:

The problem is that /bin/ksh apparently does not exist on your system.

Please try the following, assuming it runs on your system:
$ cat file1
word1
word2
word3
$ cat file2
word1
word2 word3
word2 word3 word3 word4
$ cat temp.sh
grep -f file1 file2 > good_lines
sed "s/\<[a-zA-Z0-9_]\+\>/&\n/g" good_lines > split_lines
grep -f file1 split_lines | sed "s/^ *//; s/ *$//" > matched_words
sort matched_words | uniq -c
$ ./temp.sh
   1 word1
   2 word2
   3 word3
I defined a "Word" as the standard [a-zA-Z0-9_].
So this includes "Words" with numbers and underscores.
Alternatively, you could use [a-zA-Z].
Or maybe you want to count "auto-correct" as one word.
In that case, [a-zA-Z-] would work.

The standard word you defined is great as it is.

I created the temp script but it did not work as expected in one of my systems

 $ ./temp.sh
sed: Function s/\<[a-zA-Z0-9_]\+\>/& cannot be parsed.

I'm not sure why some sed functions are not functioning/installed here. Any ideas to circumvent this error?

However in my other system the result was as expected, so thanks a lot!

---------- Post updated at 06:59 PM ---------- Previous update was at 06:47 PM ----------

demmel:

The standard word you defined is great as it is.

I created the temp script but it did not work as expected in one of my systems
 $ ./temp.sh
sed: Function s/\<[a-zA-Z0-9_]\+\>/& cannot be parsed.
I'm not sure why some sed functions are not functioning/installed here. Any ideas to circumvent this error?

However in my other system the result was as expected, so thanks a lot!

I was able to prevent the error by using single quotes instead of double quotes, still the result did not come right, see below:

$ ./temp.sh
   1 word1
   1 word2 word3
   1 word2 word3 word3 word4

This is the content of the file split_lines:

word1
word2 word3
word2 word3 word3 word4

Any ideas?

hanson44 · March 28, 2013, 6:22pm

It's something to do with the sed line. The best way to figure it out is to copy and paste the temp.sh shell script, exactly as it is on your system, and include it with the message. No point in guessing.