Creating Frequency of words from a file by accessing a corpus

gimley · July 23, 2013, 9:59pm

Hello,
I have a large file of syllables /strings in Urdu. Each word is on a separate line.
Example in English:

be
at
for
if
being
attract

I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and identify the frequency of each string.
Is there a perl or awk script which can do the job.
Many thanks for your help

MR.bean · July 23, 2013, 10:25pm

$ awk '{txt[$0]++} END{for (i in txt) { printf "%-8d %s\n"  ,txt,i }}' list
1        attract
1        if
1        at
1        being
1        for
1        be

Chubler_XL · July 23, 2013, 11:43pm

You could also use sort and uniq like this:

$ sort corpus | uniq -c
      1 at
      1 attract
      1 be
      1 being
      1 for
      1 if

durden_tyler · July 24, 2013, 12:03am

$ 
$ # "wordlist.txt" is a list of words that we have to check
$ cat wordlist.txt
be
at
for
if
being
attract
$ 
$ # "poe_the_gold_bug.txt" is a text file against which we have to
$ # check the words. This file contains the story "The Gold Bug" by
$ # Edgar Allen Poe from the Project Gutenberg website.
$ wc poe_the_gold_bug.txt
 1460 13462 76460 poe_the_gold_bug.txt
$ 
$ # A Perl program to check the frequency of words from "wordlist.txt"
$ # in the file "poe_the_gold_bug.txt"
$ cat -n word_occurrences.pl
     1    #!/usr/bin/perl -w
     2    use strict;
     3    my $wordfile = $ARGV[0];
     4    my $testfile = $ARGV[1];
     5    my %occurrences;
     6    open(WF, "<", $wordfile) or die "Can't open $wordfile: $!";
     7    while (<WF>) {
     8      chomp;
     9      $occurrences{$_} = 0
    10    }
    11    close(WF) or die "Can't close $wordfile: $!";
    12    open(TF, "<", $testfile) or die "Can't open $testfile: $!";
    13    while (<TF>) {
    14      chomp;
    15      while (/(\w+)/g) {
    16        $occurrences{$1}++ if defined $occurrences{$1};
    17      }
    18    }
    19    close(TF) or die "Can't close $testfile: $!";
    20    while (my ($k, $v) = each %occurrences) {
    21      printf("%-10s occurs %5d times\n", $k, $v);
    22    }
$ 
$ # Execution of the Perl program
$ perl word_occurrences.pl wordlist.txt poe_the_gold_bug.txt
attract    occurs     0 times
for        occurs   109 times
be         occurs    72 times
at         occurs    96 times
being      occurs    13 times
if         occurs    24 times
$ 
$

gimley · July 24, 2013, 12:19am

Hello,
I tried the awk script but it does not work.
I created a file called txt which is the source file for which the frequencies have to be found

eng
book
shop
writ

and a large file of English words which I am appending as a zip for testing.
The idea is that the script should find the strings provided in the input file and spew out all words containing their frequency.
Thus in the corpus 1134 instances of eng were detected (did this in Ultraedit) and a sample output desired is provided below:

eng=1134
engine
strength
revenge
engaged
challenge
passengers
engineer
engagement
engines
messenger
length
vengeance
passenger
engage
avenge
engineering
engine
engineers
Deng
challenged
challenging
penguin

Many thanks for the help. Please note that I cannot use Unix tools since I work in Windows/DOS.

durden_tyler · July 24, 2013, 1:07am

Looks like "grep" returns a different count for "eng" than UltraEdit. But the counts determined by grep and Perl are consistent.

$ 
$ cat words.txt
eng
book
shop
writ
$ 
$ 
$ grep eng en.txt | wc -l
1123
$ 
$ grep book en.txt | wc -l
220
$ 
$ grep shop en.txt | wc -l
147
$ 
$ grep writ en.txt | wc -l
176
$ 
$ # The Perl program
$ cat -n word_frequency.pl
     1    #!/usr/bin/perl -w
     2    use strict;
     3    my $wordfile = $ARGV[0];
     4    my $testfile = $ARGV[1];
     5    my %occurrences;
     6    open(WF, "<", $wordfile) or die "Can't open $wordfile: $!";
     7    while (<WF>) {
     8      chomp;
     9      $occurrences{$_} = 0
    10    }
    11    close(WF) or die "Can't close $wordfile: $!";
    12    open(TF, "<", $testfile) or die "Can't open $testfile: $!";
    13    while (<TF>) {
    14      chomp(my $word = $_);
    15      foreach my $k (keys %occurrences) {
    16        $occurrences{$k}++ if $word =~ /$k/
    17      }
    18    }
    19    close(TF) or die "Can't close $testfile: $!";
    20    while (my ($k, $v) = each %occurrences) {
    21      printf("%-10s occurs %5d times\n", $k, $v);
    22    }
$ 
$ # "en.txt" is the file you attached in your post
$ perl word_frequency.pl words.txt en.txt
shop       occurs   147 times
book       occurs   220 times
writ       occurs   176 times
eng        occurs  1123 times
$ 
$

MR.bean · July 24, 2013, 4:18am

In awk, this might be long-winded

bash-3.2$ cat list
be
at
for
if
being
attract
bash-3.2$ cat input
at
be
bash-3.2$ 
bash-3.2$ 
bash-3.2$ awk 'BEGIN { while((getline line < "input") > 0) { pat[line] = 0 } }  { for(x in pat) { if($0 ~ x) { pat[x]++; matched[x,pat[x]]=$0; } } }  END { for (x in pat) { print x"="pat[x]; for (c=1; c<=pat[x]; c++) { print matched[x,c] } }}' list
be=2
be
being
at=2
at
attract

gimley · July 24, 2013, 9:15am

Many thanks, The script worked. And it was pretty fast. I had around 16,000 strings and a corpus of around 30MB.
The output took around 4 minutes, but the wait was worth it.