Hello,
I have a large file of syllables /strings in Urdu. Each word is on a separate line.
Example in English:
be
at
for
if
being
attract
I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and identify the frequency of each string.
Is there a perl or awk script which can do the job.
Many thanks for your help
$
$ # "wordlist.txt" is a list of words that we have to check
$ cat wordlist.txt
be
at
for
if
being
attract
$
$ # "poe_the_gold_bug.txt" is a text file against which we have to
$ # check the words. This file contains the story "The Gold Bug" by
$ # Edgar Allen Poe from the Project Gutenberg website.
$ wc poe_the_gold_bug.txt
1460 13462 76460 poe_the_gold_bug.txt
$
$ # A Perl program to check the frequency of words from "wordlist.txt"
$ # in the file "poe_the_gold_bug.txt"
$ cat -n word_occurrences.pl
1 #!/usr/bin/perl -w
2 use strict;
3 my $wordfile = $ARGV[0];
4 my $testfile = $ARGV[1];
5 my %occurrences;
6 open(WF, "<", $wordfile) or die "Can't open $wordfile: $!";
7 while (<WF>) {
8 chomp;
9 $occurrences{$_} = 0
10 }
11 close(WF) or die "Can't close $wordfile: $!";
12 open(TF, "<", $testfile) or die "Can't open $testfile: $!";
13 while (<TF>) {
14 chomp;
15 while (/(\w+)/g) {
16 $occurrences{$1}++ if defined $occurrences{$1};
17 }
18 }
19 close(TF) or die "Can't close $testfile: $!";
20 while (my ($k, $v) = each %occurrences) {
21 printf("%-10s occurs %5d times\n", $k, $v);
22 }
$
$ # Execution of the Perl program
$ perl word_occurrences.pl wordlist.txt poe_the_gold_bug.txt
attract occurs 0 times
for occurs 109 times
be occurs 72 times
at occurs 96 times
being occurs 13 times
if occurs 24 times
$
$
Hello,
I tried the awk script but it does not work.
I created a file called txt which is the source file for which the frequencies have to be found
eng
book
shop
writ
and a large file of English words which I am appending as a zip for testing.
The idea is that the script should find the strings provided in the input file and spew out all words containing their frequency.
Thus in the corpus 1134 instances of eng were detected (did this in Ultraedit) and a sample output desired is provided below:
bash-3.2$ cat list
be
at
for
if
being
attract
bash-3.2$ cat input
at
be
bash-3.2$
bash-3.2$
bash-3.2$ awk 'BEGIN { while((getline line < "input") > 0) { pat[line] = 0 } } { for(x in pat) { if($0 ~ x) { pat[x]++; matched[x,pat[x]]=$0; } } } END { for (x in pat) { print x"="pat[x]; for (c=1; c<=pat[x]; c++) { print matched[x,c] } }}' list
be=2
be
being
at=2
at
attract
Many thanks, The script worked. And it was pretty fast. I had around 16,000 strings and a corpus of around 30MB.
The output took around 4 minutes, but the wait was worth it.