demmel
February 18, 2013, 5:23pm
1
Hi guys,
I'm struggling with this one, any help is appreciated.
I have File1 with hundreds of unique words, like this:
word1
word2
word3
I want to count each word from file1 in file2 and return how many times each word is found.
I tried something like this:
for i in $(cat file1); do echo -n "$i ";tr -s ' ' '\n' < file2| grep -c "$i ";done > temp03
but its returning zero for every word even if it does exist.
Can you please help?
Thanks a lot!
Yoda
February 18, 2013, 5:51pm
2
Try something like:
while read word
do
c=$( tr ' ' '\n' < file2 | grep -c "${word}" )
echo "${word} ${c}"
done < file1
1 Like
gary_w
February 18, 2013, 5:54pm
3
#!/bin/ksh
FILE1=x1.dat
FILE2=x2.dat
while read word
do
# -w option to grep for searching on a word boundary.
count=$( grep -wc "$word" $FILE2 )
printf "%s: %d\n" "$word" $count
done < $FILE1
exit 0
$ cat x1.dat
word1
word2
word3
$ cat x2.dat
word1
word2
word3
word1
word2
word3oword1
word2
word2
word2
word3
word3
word3
word3 word3word3
$ ./x
word1: 2
word2: 5
word3: 5
$
I would advise against using tr
, since that translation of the entire file is happening for each word in FILE1.
1 Like
for i in $(cat file1); do echo -n "$i ";tr -s ' ' '\n' < file2| grep -c "$i ";done > temp03
change echo -n to echo.
1 Like
RudiC
February 19, 2013, 6:50am
5
Using gary_w's files, would this satisfy your needs:
$ grep -of x1.dat x2.dat| sort |uniq -c
3 word1
5 word2
8 word3
---------- Post updated at 12:50 ---------- Previous update was at 12:44 ----------
That wouldn't help. Remove the space in grep
's "$i "
parameter...
1 Like
gary_w
February 19, 2013, 8:48am
6
Note that RudiC's method counts the word if it is part of another word. I do not know if this is the desired result as the original spec did not get that detailed. Just sayin'.
1 Like
RudiC
February 19, 2013, 10:16am
7
Well, then, use the -w switch to grep. This then would yield exactly gary_w's result. But I'm not sure that this will work on all grep versions.
1 Like
demmel
March 27, 2013, 7:58pm
8
I actually prefer counting the words even if its between other words, so RudiC seems the best option, anyways all the replies helped me a lot, so thanks to all!
---------- Post updated at 06:49 PM ---------- Previous update was at 05:44 PM ----------
rudic:
Using gary_w's files, would this satisfy your needs:
$ grep -of x1.dat x2.dat| sort |uniq -c
3 word1
5 word2
8 word3
---------- Post updated at 12:50 ---------- Previous update was at 12:44 ----------
That wouldn't help. Remove the space in grep
's "$i "
parameter...
Unfortunately the grep -o is not installed in my server, and I cant do anything about it.
grep: illegal option -- o
Do you know if its possible to do something similar?
---------- Post updated at 08:58 PM ---------- Previous update was at 06:49 PM ----------
I got the result expected by using the following:
for i in $(cat x1.dat); do echo "$i ";tr -s ' ' '\n' < x2.dat| grep -c "$i";done
However, the result is coming up like this:
word1
3
word2
5
word3
8
But I expected to be like this:
3 word1
5 word2
8 word3
Can anyone help further?
Yoda
March 27, 2013, 8:57pm
9
Here is a KSH script using Associative Arrays for counting words:
#!/bin/ksh
typeset -A word_ARR
while read line
do
for word in $line
do
(( word_ARR[$word]++ ))
done
done < file.txt
for key in ${!word_ARR[*]}
do
print ${word_ARR[$key]} $key
done
1 Like
demmel
March 27, 2013, 10:44pm
10
yoda:
Here is a KSH script using Associative Arrays for counting words:
#!/bin/ksh
typeset -A word_ARR
while read line
do
for word in $line
do
(( word_ARR[$word]++ ))
done
done < file.txt
for key in ${!word_ARR
[*]}
do
print ${word_ARR[$key]} $key
done
I may be doing something wrong, but I'm unable to get any results from this script.
I ran the same , only replacing the file input name.
Tried in 2 dif envs:
1-$ ./array
./array[3]: typeset: bad option(s)
2-$./array
bash: ./array: /bin/ksh: bad interpreter: No such file or directory
Any clue where is the problem?
The problem is that /bin/ksh apparently does not exist on your system.
Please try the following, assuming it runs on your system:
$ cat file1
word1
word2
word3
$ cat file2
word1
word2 word3
word2 word3 word3 word4
$ cat temp.sh
grep -f file1 file2 > good_lines
sed "s/\<[a-zA-Z0-9_]\+\>/&\n/g" good_lines > split_lines
grep -f file1 split_lines | sed "s/^ *//; s/ *$//" > matched_words
sort matched_words | uniq -c
$ ./temp.sh
1 word1
2 word2
3 word3
I defined a "Word" as the standard [a-zA-Z0-9_].
So this includes "Words" with numbers and underscores.
Alternatively, you could use [a-zA-Z].
Or maybe you want to count "auto-correct" as one word.
In that case, [a-zA-Z-] would work.
1 Like
Yoda
March 28, 2013, 12:21am
12
I forgot to mention that you require KSH93 to support this code.
KSH88 does not support typeset
option -a
to define arrays.
1 Like
demmel
March 28, 2013, 5:59pm
13
hanson44:
The problem is that /bin/ksh apparently does not exist on your system.
Please try the following, assuming it runs on your system:
$ cat file1
word1
word2
word3
$ cat file2
word1
word2 word3
word2 word3 word3 word4
$ cat temp.sh
grep -f file1 file2 > good_lines
sed "s/\<[a-zA-Z0-9_]\+\>/&\n/g" good_lines > split_lines
grep -f file1 split_lines | sed "s/^ *//; s/ *$//" > matched_words
sort matched_words | uniq -c
$ ./temp.sh
1 word1
2 word2
3 word3
I defined a "Word" as the standard [a-zA-Z0-9_].
So this includes "Words" with numbers and underscores.
Alternatively, you could use [a-zA-Z].
Or maybe you want to count "auto-correct" as one word.
In that case, [a-zA-Z-] would work.
The standard word you defined is great as it is.
I created the temp script but it did not work as expected in one of my systems
$ ./temp.sh
sed: Function s/\<[a-zA-Z0-9_]\+\>/& cannot be parsed.
I'm not sure why some sed functions are not functioning/installed here. Any ideas to circumvent this error?
However in my other system the result was as expected, so thanks a lot!
---------- Post updated at 06:59 PM ---------- Previous update was at 06:47 PM ----------
demmel:
The standard word you defined is great as it is.
I created the temp script but it did not work as expected in one of my systems
$ ./temp.sh
sed: Function s/\<[a-zA-Z0-9_]\+\>/& cannot be parsed.
I'm not sure why some sed functions are not functioning/installed here. Any ideas to circumvent this error?
However in my other system the result was as expected, so thanks a lot!
I was able to prevent the error by using single quotes instead of double quotes, still the result did not come right, see below:
$ ./temp.sh
1 word1
1 word2 word3
1 word2 word3 word3 word4
This is the content of the file split_lines:
word1
word2 word3
word2 word3 word3 word4
Any ideas?
It's something to do with the sed line. The best way to figure it out is to copy and paste the temp.sh shell script, exactly as it is on your system, and include it with the message. No point in guessing.