Greping frequency for list of phrases is a separate file

A-V · February 19, 2014, 11:04am

Dear All I have a big set of data which I would like to summerize to have a better sense of it
the problem is like this ...
I have more than 200 files each related to a person which includes different sentences with specific elements (files in a directory)
e.g. mark (one file in the directory)

I like reading books
I have five books
I love flowers
I am not allergic to flowers (list.txt)
...

the I have a file with 200 or more phrases like

books
flowers
house pet
cooking skills
...

now I want to create a file that will create

.......|books|flowers|house pets| ...
mark |   2   |   2     | 0      |....
john  |   5   |   0     |  2     |...

Can someone help me please ?

I have tried this

mkdir result
FFILES="People/*"
for U in $FFILES
do
	docU=$(basename $U)
	docpathU=$(dirname $U)
	grep  -o -f accept-list.txt $U | sort | uniq -c | awk 'BEGIN{FS=" ";}{print $2,","$1}' > result/${docU}
done

but this has few problems i dont know how to address

it those not work for words such as "house pet" and i have many lines which are phrases
so I need to get frequency count by line not by words
dont know how I can summerise all into the desired structure which is more combined and horizontal

many thanks for the help
A-V

Lucas_0418 · February 19, 2014, 12:00pm

Hi, I did some modification with your code.
Now it may works as you want, but may not all of the columns could format print beautifully.
And I must go to sleep now~
Good night.

mkdir result
FFILES="People/*"
awk 'BEGIN{printf "PeopleName"}{printf "|"$0}END{printf "\n"}' accept-list.txt >result/total_result #added
for U in $FFILES
do
 docU=$(basename $U)
 docpathU=$(dirname $U)
# grep  -o -f accept-list.txt  $U | sort | uniq -c | awk 'BEGIN{FS=" ";}{print $2,","$1}' > result/${docU} #comment
 grep -o -f accept-list.txt  $U|sort|uniq -c|awk '{s=$1;$1="";print $0","s}'|sed 's/^[[:blank:]]*//' > result/${docU} #modified
 awk -F',' -vn=${docU} 'NR==FNR{a[i++]=$0;next}{b[$1]=$2}END{printf n;for(i=0;i<length(a);i++){if(a in b){printf "|"b[a]}else{printf "|\t"}}printf "\n"}' accept-list.txt  result/${docU}>>result/total_result #added
done

Yoda · February 19, 2014, 12:43pm

An awk approach:

awk '
        NR == FNR {
                P[$0]
                next
        }
        {
                if ( ! ( FILENAME in F ) )
                        F[FILENAME]

                for ( k in P )
                {
                        if ( $0 ~ k )
                        {
                                R[FILENAME FS k]++
                        }
                }
        }
        END {
                printf "\t"
                for ( k in P )
                        printf "%s\t", k
                printf "\n"

                for ( j in F )
                {
                        printf "%s\t", j
                        for ( k in P )
                                printf "%s\t\t", (R[j FS k] ? R[j FS k] : 0)
                        printf "\n"
                }
        }
' phrases mark john

Input:

$ cat phrases
books
flowers
house pet
cooking skills

$ cat mark
I like reading books
I have five books
I love flowers
I am not allergic to flowers (list.txt)

$ cat john
I love house pets
I collect books

Output:

        books   cooking skills  house pet       flowers
john    1               0               1               0
mark    2               0               0               2

A-V · February 19, 2014, 1:36pm

Lucas, thanks a lot for the help ... i managed to make it work even without the first like and it seems to be ok
i replaces "|" and "|\t" with "," and "0" and saved as CSV file
now I am only wondering whether I can have the phrases as the header for the file ?
cheers

---------- Post updated at 01:36 PM ---------- Previous update was at 12:43 PM ----------

Dear Yoda,
there are few things I am not sure how to handle === the code is bit complicated for me to understand

can I remove the count of words exactly after the people's name? --- and have the count of the words in that row at the end of it?
i can feed the results into one file but then the headers keep reappearing

' accept-all.txt $U 
done > result-here.csv

otherwise I can create a loop around it and

' accept-all.txt $U > result/${docU}
done

I have replaces the tabs with ", " and made a CSV out of it

Yoda · February 19, 2014, 4:59pm

The program will not work if you feed one input file at a time using a for loop.

You could pass them all at once:

awk '
        -- code --
' accept-all.txt People/*

A-V · February 19, 2014, 5:28pm

thats what I have done at the end
but still did not manage to figure out how to delete the count number next to the name or add the general frequency count of existing words in the row at the end

Yoda · February 19, 2014, 5:35pm

I'm sorry, I didn't get what you are asking. Post what you got and what is expected.

A-V · February 19, 2014, 5:43pm

what I get is

               books   cooking skills  house pet       flowers
john     20       1               0               1               0
mark    32       2               0               0               2

which is the number of the words in that file
but what I would like is

        books   cooking skills  house pet       flowers        count
john    1               0               1               0              2
mark    2               0               0               2             4

Yoda · February 19, 2014, 6:24pm

Apply these changes:

        END {
                printf "\t"
                for ( k in P )
                        printf "%s\t", k
                printf "\tCount\n"

                for ( j in F )
                {
                        printf "%s\t", j
                        for ( k in P )
                        {
                                i = j FS k
                                if ( i in R )
                                {
                                        printf "%s\t\t", R
                                        c += R
                                }
                                else
                                        printf "%s\t\t", "0"
                        }
                        printf "%s\n", c
                        c = 0
                }
        }
' phrases mark john

Output:

        books   cooking skills  house pet       flowers         Count
john    1               0               1               0               2
mark    2               0               0               2               4

A-V · February 19, 2014, 6:57pm

somehow i get number of lines for each file has next to the name so when my data is being added it gets added to that number and I dont know how to delete that
e.g.

                  books   cooking skills  house pet       flowers       Count
john   20          1               0               1               0               22
mark  10          2               0               0               2               14

my exact code is as following

#!/bin/bash
mkdir results
FFILES="People/*"
for U in $FFILES
do
 docU=$(basename $U)
 docpathU=$(dirname $U)
 awk '
        NR == FNR {
                P[$0]
                next
        }
        {
                if ( ! ( FILENAME in F ) )
                        F[FILENAME]

                for ( k in P )
                {
                        if ( $0 ~ k )
                        {
                                R[FILENAME FS k]++
                        }
                }
        }
END {
                printf ","
                for ( k in P )
                        printf "%s,", k
                printf "Count\n"

                for ( j in F )
                {
                        printf "%s,", j
                        for ( k in P )
                        {
                                i = j FS k
								if ( i in R )
                                {
                                        printf "%s,", R
                                        c += R
                                }
                                else
                                        printf "%s,", "0"
                        }
                        printf "%s\n", c
                        c = 0
                }
        }
' accept-list.txt $U > result/${docU}.csv
done