Dear All I have a big set of data which I would like to summerize to have a better sense of it
the problem is like this ...
I have more than 200 files each related to a person which includes different sentences with specific elements (files in a directory)
e.g. mark (one file in the directory)
I like reading books
I have five books
I love flowers
I am not allergic to flowers (list.txt)
...
the I have a file with 200 or more phrases like
books
flowers
house pet
cooking skills
...
now I want to create a file that will create
.......|books|flowers|house pets| ...
mark | 2 | 2 | 0 |....
john | 5 | 0 | 2 |...
Can someone help me please ?
I have tried this
mkdir result
FFILES="People/*"
for U in $FFILES
do
docU=$(basename $U)
docpathU=$(dirname $U)
grep -o -f accept-list.txt $U | sort | uniq -c | awk 'BEGIN{FS=" ";}{print $2,","$1}' > result/${docU}
done
but this has few problems i dont know how to address
it those not work for words such as "house pet" and i have many lines which are phrases
so I need to get frequency count by line not by words
dont know how I can summerise all into the desired structure which is more combined and horizontal
Hi, I did some modification with your code.
Now it may works as you want, but may not all of the columns could format print beautifully.
And I must go to sleep now~
Good night.
awk '
NR == FNR {
P[$0]
next
}
{
if ( ! ( FILENAME in F ) )
F[FILENAME]
for ( k in P )
{
if ( $0 ~ k )
{
R[FILENAME FS k]++
}
}
}
END {
printf "\t"
for ( k in P )
printf "%s\t", k
printf "\n"
for ( j in F )
{
printf "%s\t", j
for ( k in P )
printf "%s\t\t", (R[j FS k] ? R[j FS k] : 0)
printf "\n"
}
}
' phrases mark john
Input:
$ cat phrases
books
flowers
house pet
cooking skills
$ cat mark
I like reading books
I have five books
I love flowers
I am not allergic to flowers (list.txt)
$ cat john
I love house pets
I collect books
Output:
books cooking skills house pet flowers
john 1 0 1 0
mark 2 0 0 2
Lucas, thanks a lot for the help ... i managed to make it work even without the first like and it seems to be ok
i replaces "|" and "|\t" with "," and "0" and saved as CSV file
now I am only wondering whether I can have the phrases as the header for the file ?
cheers
---------- Post updated at 01:36 PM ---------- Previous update was at 12:43 PM ----------
Dear Yoda,
there are few things I am not sure how to handle === the code is bit complicated for me to understand
can I remove the count of words exactly after the people's name? --- and have the count of the words in that row at the end of it?
i can feed the results into one file but then the headers keep reappearing
' accept-all.txt $U
done > result-here.csv
otherwise I can create a loop around it and
' accept-all.txt $U > result/${docU}
done
I have replaces the tabs with ", " and made a CSV out of it
thats what I have done at the end
but still did not manage to figure out how to delete the count number next to the name or add the general frequency count of existing words in the row at the end
END {
printf "\t"
for ( k in P )
printf "%s\t", k
printf "\tCount\n"
for ( j in F )
{
printf "%s\t", j
for ( k in P )
{
i = j FS k
if ( i in R )
{
printf "%s\t\t", R
c += R
}
else
printf "%s\t\t", "0"
}
printf "%s\n", c
c = 0
}
}
' phrases mark john
Output:
books cooking skills house pet flowers Count
john 1 0 1 0 2
mark 2 0 0 2 4
somehow i get number of lines for each file has next to the name so when my data is being added it gets added to that number and I dont know how to delete that
e.g.
books cooking skills house pet flowers Count
john 20 1 0 1 0 22
mark 10 2 0 0 2 14
my exact code is as following
#!/bin/bash
mkdir results
FFILES="People/*"
for U in $FFILES
do
docU=$(basename $U)
docpathU=$(dirname $U)
awk '
NR == FNR {
P[$0]
next
}
{
if ( ! ( FILENAME in F ) )
F[FILENAME]
for ( k in P )
{
if ( $0 ~ k )
{
R[FILENAME FS k]++
}
}
}
END {
printf ","
for ( k in P )
printf "%s,", k
printf "Count\n"
for ( j in F )
{
printf "%s,", j
for ( k in P )
{
i = j FS k
if ( i in R )
{
printf "%s,", R
c += R
}
else
printf "%s,", "0"
}
printf "%s\n", c
c = 0
}
}
' accept-list.txt $U > result/${docU}.csv
done