fgrep - printing pattern and filename

Hi,

I have a patternfile with following pattern
cat
dog
cow
pig

Let's say I have thousand files
file0001
file0002
file0003
.
.
.
file1000

Each pattern can occur multiple times in multiple files. How can I search for pattern so the output of pattern and the filename is printed only once. I want the output to look like this

cat file0003 ( cat occurs in file0003 5 times but is printed only once)
cat file0500
cat file0699
dog file0001
dog file1000
pig file0999

and so on.
I used fgrep -f patternfile /directory/file* but that prints multiple lines for pattern occuring in same file.

I didn't try it so this may require some code fix but you get the idea :

while read a
do
awk -v P="$a" '$0~P{s[FILENAME" in "P]+=1}END{for(i in s) print i"appear in "s" lines"}' file* 
done <patternfile

But this will lead to cartesian product of I/O when scanning files

For better performance, i would go for another way :

First print all filesnames and their content:

awk '{print FILENAME":"$0}' file* >bigone

and then process the bigone output for the calculation:

awk -F: 'NR==FNR{P[$0];next}{for(i in P) s[i":"$1]+=gsub(i,i,$0)}END{for(k in s) print k" appears "s[k]" times"}' patternfile bigone

Store Pattern as index of associative array 'P'
Build an associative array 's' indexed with [pattern:FILENAME] storing the sum of match for that pattern
At the end of the scanning, print the result

You can alternately put it all in one :

awk '{print FILENAME":"$0}' file* | awk -F: 'NR==FNR{P[$0];next}{for(i in P)  s[i":"$1]+=gsub(i,i,$0)}END{for(k in s) print k" appears "s[k]" times"}'  patternfile -

I didn't test the code so maybe it is not perfect but this is for you to get the idea.

....

$ ls ts*
tst   tst2
$ cat tst
Alpha> lh ru warpA read DL_PM_PA0_C0
Beta> lh ru warpA read DL_PM_PA0_C0
Gamma> lh ru warpA read DL_PM_PA0_C0
Delta> lh ru warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_0_1 warpA read DL_PM_PA0_C0
BXP_0_1: Value 0x01CC9739 (30185273) read from address 0x00000B8F.
BXP_0_1: Value 0x050A2F06 (84553478) read from address 0x00000B8F.
BXP_0_1: Value 0x02563DEF (39206383) read from address 0x00000B8F.
BXP_0_1: Value 0x01CB58B7 (30103735) read from address 0x00000B8F.
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_1_1 warpA read DL_PM_PA0_C0
BXP_1_1: Value 0x05033922 (84097314) read from address 0x00000B8F.
BXP_1_1: Value 0x01CCEFB6 (30207926) read from address 0x00000B8F.
BXP_1_1: Value 0x01CED447 (30331975) read from address 0x00000B8F.
BXP_1_1: Value 0x0218E0BA (35184826) read from address 0x00000B8F.
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
$ lhsh BXP_2_1 warpA read DL_PM_PA0_C0
BXP_2_1: Value 0x0236B631 (37140017) read from address 0x00000B8F.
BXP_2_1: Value 0x01CE0AF3 (30280435) read from address 0x00000B8F.
BXP_2_1: Value 0x050FAD30 (84913456) read from address 0x00000B8F.
BXP_2_1: Value 0x01CCCC5A (30198874) read from address 0x00000B8F.
$ cat tst2
sos,WINXP,1,2,3,4,5,6,7,,9
sos,WINVISTA,1,2,3,4,5,6,7,,9
sos,MAC,1,2,3,4,5,6,7,,9
sos,LINUX,1,2,3,4,5,6,7,,9
tos,winxp,1,2,3,4,5,6,7,winxp,9
tos,winvista,1,2,3,4,5,6,7,winvista,9
tos,mac,1,2,3,4,5,6,7,mac,9
tos,linux,1,2,3,4,5,6,7,linux,9

f4 is my patternfile for testing

$ cat f4
0
1
$ nawk '{print FILENAME":"$0}' ts* | nawk -F: 'NR==FNR{P[$0];next}{for(i in P) s[i":"$1]+=gsub(i,i,$0)}END{for(k in s) print k" appears "s[k]" times"}' f4 -
1:tst appears 49 times
0:tst2 appears 0 times
1:tst2 appears 8 times
0:tst appears 156 times

read the colon ":" as "in"
Example:
1:tst2 appears 8 times
means :
the pattern "1" in the file "tst2" appears 8 times