Removing dupes within 2 delimited areas in a large dictionary file

gimley · December 5, 2012, 8:17pm

Hello,
I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header :

#DATA
#VALID 1[this could be also be 0 instead of 1]

and ends with a footer as shown below

#END

The data between the Header and the Footer consists of words, each word on a separate line.
However given the large data, it so happens that within a section, words are repeated, as a result of which the file ends up with dupes.
What I need is a PERL or AWK script which could identify the header and the footer, find the data within them and sort the data removing all duplicates.
A sample input and output are given below. The examples are from English since the real time data is in Perso-Arabic script. Case is not an issue since the language does not have case. All data is in Unicode :UTF16 but I can convert it to Unicode 8
Could it be possible to please comment the script so that I can learn how to identify Headers and Footeers with a database and then sort them removing dupes.
Many thanks in advance for help and also the learning experience

_____Sample Input

#DATA
#VALID 1
a
a
a
a
all
an
and
and
and
are
are
are
as
awk
below
case
case
could
data
data
data
data
does
dupes
duplicates
ends
english
examples
file
find
footer
from
given
given
happens
have
header
however
i
identify
in
input
is
is
is
issue
it
language
large
need
not
not
of
or
output
perl
perso-arabic
real
removing
repeated
result
sample
script
script
section
since
since
so
sort
that
the
the
the
the
the
the
the
the
the
them
time
up
what
which
which
with
within
within
words
#END

___________Expected output

#DATA
#VALID 1
a
all
an
and
are
as
awk
below
case
could
data
does
dupes
duplicates
ends
english
examples
file
find
footer
from
given
happens
have
header
however
i
identify
in
input
is
issue
it
language
large
need
not
of
or
output
perl
perso-arabic
real
removing
repeated
result
sample
script
section
since
so
sort
that
the
them
time
up
what
which
with
within
words
#END

_____Sample ends

spacebar · December 5, 2012, 9:55pm

Perl script:

my $in_file     =  '/temp/tmp/t'; # file contains your example data
my $out_file    =  '/temp/tmp/new_file.txt'; # New file with dups removed
my $line;
my @inla;
my @outla;

open ( my $in_file_fh,  '<', $in_file  ) or die "Can't open $in_file $!\n";
open ( my $out_file_fh, '>', $out_file ) or die "Can't open $out_file $!\n";

DATA: while ( $line = <$in_file_fh> ) {  # Read input file
        # If line starts with '#DATA' write to out file
        # also write to out file the next line which is a '#VALID x' line
        if ( $line =~ /^\#DATA/ ) {
          print $out_file_fh $line;
          foreach (1..1) {
            $line = <$in_file_fh>;
            print $out_file_fh $line;
          }
          # Read lines until '#END' line is read
          while ( $line = <$in_file_fh> ) {
            if ( $line =~ /^\#END/ ){
              # Create an anoymous hash from input lines in arrary(@inla) which removes duplicates
              # and place results in array: @outla
              @outla = keys %{{ map{$_=>1}@inla}};
              # write sorted out array to out file
              print $out_file_fh ( sort @outla );
              # write '#END' line to out file
              print $out_file_fh $line;
              # exit inner loop back to main loop and start over
              last DATA;
            }
            # Load lines between the '#VALID x' line and the '#END' line into array
            push @inla, $line;
          }
        }
      }

$ cat new_file.txt
#DATA
#VALID 1
a
all
an
and
are
as
awk
below
case
could
data
does
dupes
duplicates
ends
english
examples
file
find
footer
from
given
happens
have
header
however
i
identify
in
input
is
issue
it
language
large
need
not
of
or
output
perl
perso-arabic
real
removing
repeated
result
sample
script
section
since
so
sort
that
the
them
time
up
what
which
with
within
words
#END

gimley · December 5, 2012, 11:21pm

Many thanks. Am out at present. Will run the perl script and get back to you.

Jotne · December 6, 2012, 1:10am

You may use some like this:

awk '/DATA/,/END/'

to get data between DATA and END , then sort it.

gimley · December 7, 2012, 2:14am

Hello,
Sorry my Broadband was down and I could not check out the perl script. It works beautifully on ASCII data (8-bit). As soon as UTF8 or UTF16 data is addressed, no output is visible.
Does PERL give problems with Unicode?
Since my data is in Perso-Arabic, the script does not work.
Any round-about way to solve the problem. I am using the latest version of ActiveState Perl and in despair even downloaded strawberry perl but the data does not work.
I am attaching the zip file containing data in UTF8 format with Hindi as an example. There are two files testdic and testdic.out
Many thanks for the beautifully commented script. I modified it slightly as under to take input and output from command line:

#!/usr/bin/perl
my $line;
my @inla;
my @outla;

The rest of the code remains the same.
I do not think this would affect accessing a UTF8 file.
Many thanks once again

drl · December 7, 2012, 6:31am

Hi.

You might want to start with: perldoc perlunitu , then man perlunicode

You seem to be using Windows. I have used the utf8 facilities on GNU/Linux systems, but I have no idea whether that might be available in/with ActiveState Perl.

Doing an advanced search here for perl utf8 yields about 50 hits, some of which may be useful.

Best wishes ... cheers, drl

( Edit 1: add note about advanced search )

gimley · December 7, 2012, 8:50am

Hello,
I found the problem. BOM: Byte Order mark
Normally under windows a UTF-8 file starts with a BOM (byte order mark, U+FEFF), as is standard for UTF-8 files on Windows systems. I concede that it is legal for them to do so, but it is utterly pointless since the byte order is determined by the formal specification of the UTF-8 representation itself. And it just happens that, unlike the rest of UTF-8, an initial BOM will screw up a Unix system. And Perl is supposed to be

Using a hex editor I removed the FEFF and it worked like a charm.

On Linux you should have no problem, since this aberration does not exist ina Unix system

Many thanks for trying to solve the mystery.

As an aid to all of us who suffer the tyranny of the WinOS system, here is a useful link:

http://www.perlmonks.org/?node_id=599720.

This offers two solutions for the problem. Googling

comes up with more if needed.