Linguistic project: extract co-occurrences from text corpus

bobylapointe · June 24, 2012, 1:30pm

Hello guys,

I've got a big corpus (a huge text file in which words are separated by one or several spaces). I would like to know if there is a simple way - using awk for instance - to extract any co-occurrence appearing at least 3times through the whole corpus for a given word. By co-occurrence, here, I mean every word that appears to the left of this given word.

For instance: "dog"

big dog (appearing 4 times)
mean dog (appearing 3 times)
blue dog (appearing only once, thus excluded)

The output would look something like this:

big dog 4
mean dog 3

The cherry on top would be to add a condition that would exclude any combination separated by "." in the middle to avoid this scenario (for "dogs"):
Shell scripting is hard. Dogs are...
"hard. Dogs" would be rejected.

I could try to do it on my own if you would be kind enough to point me in the right direction.

Thank you very much !

bartus11 · June 24, 2012, 1:39pm

Can you post some sample data?

Scrutinizer · June 24, 2012, 1:43pm

If you use "dog|dogs" as a field separator, then any resulting field would be adjacent to the word dog or dogs, if the number of fields is >1

figaro · June 24, 2012, 1:49pm

Are you saying that if "big dog" appears 3 times or more in a given piece of text, it should return the number of occurrences, whereby the user provides the search word, in your example "dog"?
You speak of the period (".") as the delimiter, but you ultimately want to extend this to other punctuation as well, such as ! ? ; , etc?

bobylapointe · June 24, 2012, 6:25pm

I'm sorry I wasn't really clear in my first post. In more concrete words, I'm trying to see with what word the word of my choice is most commonly associated with - on its left, that is to say: word wordofmychoice - within a corpus.

The textfile looks like this:

Are you one of those people who prefer larger dogs? Do you know someone who has told you that they prefer larger dogs because small dogs are yappy and snappy? Whether you are a large-dog person or a small-dog person, one thing we all would agree on is that a larger percentage of small dogs tend to have a different type of temperament than medium and large dogs. Small dogs have earned the reputation of being yappy, snappy, jealous, protective, wary of strangers and not the greatest child companions.

Let's say I'm interested in the word "dogs". The output would be:
larger dogs
small dogs
large dogs

But I want to count how many times each association appears:
larger dogs 2
small dogs 3
large dogs 1

And, I only want to keep (print in a new file) associations appearing at least 3 times. Therefore, the final result (in a new textfile) I want to obtain would be:
small dogs 3

That's it basically. If possible, now, but this is not a priority, I would like to make sure no association contain any punctuation in the middle, to avoid getting what I would call false results. For instance, let's say I'm looking for "small" and its associations (with one word on the left) in the previous text:

"dogs. Small"

This is what I want to avoid. But once again, that's not a priority.

Thanks for your answers guys, I hope it was a bit clearer

Chubler_XL · June 24, 2012, 8:37pm

How about this:

awk -F'[- ]' -vW=dogs '
BEGIN{IGNORECASE=1;S="[.?)]"}
$0 ~ W {
  p=$1;
  for(i=2;i<=NF;i++) {
    if ($i ~ "^"W S"*$" && p !~ S) c[p]++;
    p=tolower($i)} }
END { for(w in c)
  if (c[w] >= 3) print w,W,c[w] }' infile

bobylapointe · June 25, 2012, 1:00am

Thank you Chubler, it's working flawlessly !

standingtree914 · June 26, 2012, 7:30am

You might use this code to find the occurrence of words in a txt file

$cat file name.txt | tr -cs '[A-Z][a-z]' '\012' | sort | uniq -c | sort -nr | more