Extracting words from file

akhay_ms · July 9, 2011, 4:46pm

I am having a file from which i need to extract different length words into different file. For example 2 letter word into file2, 3 letter word into file3 and so on....

I did one using grep and shell script..

for (( i=1; i<7; i++))
do
  egrep -o  '\<\(?[a-zA-Z]{$i}\)?\>' $1 | sort -u -f|tr [A-Z] [a-z] >file$i
done

But it is too slow. any better idea? Thanks in advance

birei · July 9, 2011, 6:57pm

Hi,

Test next 'perl' program:

$ cat script.pl
use warnings;
use strict;

@ARGV == 1 or die "Usage: perl $0 <input-file>\n";

my %word_length;

while ( <> ) {
        chomp;
        my @words = split /[^[:alpha:]]+/;
        my %repeated_word;
        for my $word ( @words ) {
                push @{ $word_length{ length $word } }, $word unless $repeated_word{ $word }++;
        }
}

for my $length ( keys %word_length ) {
        my $outfile = "file" . $length;
        open my $fh, ">", $outfile or do {
                warn "Cannot open $outfile: $!\n";
                next;
        };
        for my $word ( @{ $word_length{ $length } } ) {
                printf $fh "%s\n", $word;
        }

        close $fh or warn "Cannot close $outfile: $!\n";
}
$ cat infile
This is an example to 
test if 
my perl program works
as expected.
$ perl script.pl
Usage: perl script.pl <input-file>
$ perl script.pl infile
$ ls -1 file*
file2
file4
file5
file7
file8

Regards,
Birei

neutronscott · July 9, 2011, 8:50pm

nawk.


#!/usr/bin/awk -f
BEGIN { FS="[^A-Za-z]" }
{
        for (i=1;i<=NF;i++)
                if ((len = length($i)) < 7 && len >= 1)
                        a[tolower($i)]++
}
END {
        for (e in a)
                print e >> "file" length(e) ".txt"
}

mute@goflex:~/test$ ./extract.awk infile
mute@goflex:~/test$ grep -H -E ? file?.txt
file2.txt:my
file2.txt:to
file2.txt:an
file2.txt:as
file2.txt:if
file2.txt:is
file4.txt:this
file4.txt:perl
file4.txt:test
file5.txt:works

rdcwayx · July 10, 2011, 9:22pm

awk '{for (i=1;i<=NF;i++) {if ($i~/^[a-zA-Z]+$/) print tolower($i)> "file" length($i)}}' infile

alister · July 10, 2011, 11:01pm

A pure AWK or Perl solution is probably the most efficient approach, but here's one that makes do without:

#!/bin/sh

tr -cs A-Za-z '[\n*]' < "$1" | sort -uf | tr A-Z a-z |
while read w; do
        [ -n "$w" ] && [ ${#w} -lt 7 ] && printf '%s\n' "$w" >> file${#w}
done

Regards,
Alister