Extract lines from files

hi all,

I have three files.

The first file (FILE_INFO in my code) consists of four parameters for each line.

0.00765600    0.08450704    M3    E3 
0.00441931    0.04878049    M4    E5 
0.01904574    0.21022727    M5    E10
0.00510400    0.05633803    M6    E12
0.00905960    0.10000000    M7    E16
0.00799376    0.08823529    M8    E17
0.00424669    0.04687500    M9    E18

I want to write out the corresponding sentences from 2nd file (M_IN in my code) and 3rd file (E_IN in my code) based on the 3rd column and 4th column parameters of first file. The M# and E# are the sentence numbers in 2nd and 3rd files.

The format of 2nd and 3rd files is : [where M# and E# are for sentence numbers in 2nd and 3rd files]

M4  asd
M4  dfgg
M4  rtyt
M4  rtytry
M4  etrert
M4  EOS
M5  tyuty
M5  ertert
M5  yuyu
M5  EOS
M6  iui
M6  jkjk
M6  EOS

EOS means the fullstop .(End of sentence)

Please correct the script I have written. The E_OUT and M_OUT are the output files where the corresponding sentences will be written.

while(my $m_text = <$FILE_INFO> ){
        
     @me_text = split /\s+/, $m_text;    
     
     while(my $m_input= <$M_IN>)
     {
    @m_no = split /\s+/, $m_input;
    if($m_no[0] eq $me_text[2])
    {
        chomp; 
        s/^M\d[ ]+//g; 
        s/[ ]*$//; 
        $x .= " ".$_;
            $x =~ s/^ //; 
        $x =~ s/ EOS[ ]*/.\n/g; 
    }
      }    
    print M_OUT $_;
     
        
    while(my $e_input = <$E_IN>)
    {
    @e_no = split /\s+/, $e_input;
    if($e_no[0] eq $me_text[3])
    {
                chomp;
                s/^E\d[ ]+//g;
                s/[ ]*$//;
                $x .= " ".$_;
                $x =~ s/^ //;
                $x =~ s/ EOS[ ]*/.\n/g;
        }
    }
        print E_OUT $_;

Expected output in the M_OUT

asd dfgg rtyt rtytry etrert .
tyuty ertert yuyu .
iui jkjk.

Similarly same format in the E_OUT will appear picking up the corresponding sentences from E_IN file based on the parameter in FILE_INFO file.

Thanks in advance.

Hi.

Many people don't wish to slog through someone else's code to find logic errors. You can use intermediate prints or the debug facility of perl to see where your code is incorrect. You could also look at provably correct code to see how it works.

Here's a solution in shell:

#!/usr/bin/env bash

# @(#) s1	Demonstrate creation of string from index of strings.

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) tr cut grep sed
set -o nounset
echo

FILE1=data1
FILE2=data2

echo
echo " Data file $FILE1:"
cat $FILE1

echo
echo " Data file $FILE2:"
cat $FILE2

echo
echo " Results:"
tr -s ' ' <$FILE1 |
cut -d" " -f3 >t1

for key in $( cat t1 )
do
  # echo
  # echo " File $key:"
  if [ -z "$(grep "$key" $FILE2)" ] 
  then
    echo " Ignoring $key: no match." >&2
	continue
  fi
  grep "$key" $FILE2 |
  tr -s ' ' |
  cut -d" " -f2 |
  paste -d" "  -s |
  sed 's/ EOS/./'
done

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
tr (GNU coreutils) 6.10
cut (GNU coreutils) 6.10
GNU grep 2.5.3
GNU sed version 4.1.5


 Data file data1:
0.00765600    0.08450704    M3    E3 
0.00441931    0.04878049    M4    E5 
0.01904574    0.21022727    M5    E10
0.00510400    0.05633803    M6    E12
0.00905960    0.10000000    M7    E16
0.00799376    0.08823529    M8    E17
0.00424669    0.04687500    M9    E18

 Data file data2:
M4  asd
M4  dfgg
M4  rtyt
M4  rtytry
M4  etrert
M4  EOS
M5  tyuty
M5  ertert
M5  yuyu
M5  EOS
M6  iui
M6  jkjk
M6  EOS

 Results:
 Ignoring M3: no match.
asd dfgg rtyt rtytry etrert.
tyuty ertert yuyu.
iui jkjk.
 Ignoring M7: no match.
 Ignoring M8: no match.
 Ignoring M9: no match.

So now that I think I understand the problem, I do a perl version and try to make sure that I avoid reading the data file more than once (as is done with grep in the shell script):

#!/usr/bin/perl

# @(#) p1	Demonstrate creation of string from index of strings.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

my ( $f1, $f2, $i, @index, $junk, $key, %sentence, $t1 );

open( $f1, "<", "data1" ) || die(" Cannot open data1.\n");

# Get the indices into array "index".

while (<$f1>) {
  $t1 = (split)[2];
  push @index, $t1;
}
print " index is :@index:\n" if $debug;
close $f1;

open( $f2, "<", "data2" ) || die(" Cannot open data2.\n");

# Read data file of words, check for match to anything in array
# "index", add to appropriate sentence hash.

while (<$f2>) {
  chomp;
  print " Working on line :$_:\n" if $debug;
  for ( $i = 0; $i <= $#index; $i++ ) {
    if (/^$index[$i]/) {
      print " Found match for :$_:\n" if $debug;
      $t1 = (split)[1];
      print " Adding :$t1: to sentence.\n" if $debug;
      $sentence{ $index[$i] } .= "$t1 ";
    }
  }
}

# Print the completed hash of sentences.

for $key ( sort keys %sentence ) {
  $sentence{$key} =~ s/ EOS/./;
  print "$sentence{$key}\n";
}

exit(0);

Using the same sample data files, produces:

% ./p1
asd dfgg rtyt rtytry etrert. 
tyuty ertert yuyu. 
iui jkjk.

Note the use of print statements if $debug is true. Simply swapping the position of the assignments turns on and off those debugging outputs. That's useful for a quick program, and the code can be left in, ready to turn on if and when the code is modified ... cheers, drl

PS I eliminated the extra trailing space before the full stop, it looked better that way.

Definitely, Thanks a lot.

---------- Post updated 08-09-09 at 04:32 AM ---------- Previous update was 08-08-09 at 03:17 PM ----------

Hi drl

The perl script works well for lesser number of sentences but when the number crosses 15 or more. A group of sentences merge together to form a line.Also, Some of the sentences get printed repeatedly. In fact, I want to run this program for thousands of sentences. Is this problem due to the array or something else?

Hi.

Yes, the lines will be quite long.

Do you want the "EOS" to be full-stop AND end-of-line? ... cheers, drl

I want EOS as End-of-Sentence as well as end-of-line rather than fullstop.

Hi.

OK, I changed the way that the sentence data structure is handled. The memory use might be high for a very large file, but for the sample data you have provided, this produces the same output:

#!/usr/bin/perl

# @(#) p1	Demonstrate creation of string from index of strings.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

my ( $f1, $f2, $i, @index, $junk, $key, %sentence, $t1 );

open( $f1, "<", "data1" ) || die(" Cannot open data1.\n");

# Get the indices into array "index".

while (<$f1>) {
  $t1 = (split)[2];
  push @index, $t1;
}
print " index is :@index:\n" if $debug;
close $f1;

open( $f2, "<", "data2" ) || die(" Cannot open data2.\n");

# Read data file of words, check for match to anything in array
# "index", add to appropriate sentence hash.

while (<$f2>) {
  chomp;
  print " Working on line :$_:\n" if $debug;
  for ( $i = 0; $i <= $#index; $i++ ) {
    if (/^$index[$i]/) {
      print " Found match for :$_:\n" if $debug;
      $t1 = (split)[1];
      print " Adding :$t1: to sentence.\n" if $debug;
      $sentence{ $index[$i] } .= "$t1 ";
    }
  }
}

# Print the completed hash of sentences.

for $key ( sort keys %sentence ) {
  $sentence{$key} =~ s/ *EOS */.\n/g;
  print "$sentence{$key}";
}

exit(0);

producing:

% ./p1
asd dfgg rtyt rtytry etrert.
tyuty ertert yuyu.
iui jkjk.

cheers, drl

Hi

After the correction of the code, some of the sentences get printed repeatedly at the output side. What could be the problem? I want the individual sentences to be printed only once. So, what can be done in this regard?

Hi

There is a slight change in the problem definition. The first file me.txt is in the format as follows

0.01474818	  0.16279070	M1	E1
0.01081743	  0.11940299	M2	E2
0.00765600	  0.08450704	M3	E3 
0.00441931	  0.04878049	M4	E5 
0.01904574	  0.21022727	M5	E10
0.00510400	  0.05633803	M6	E12
0.00905960	  0.10000000	M7	E16
0.00799376	  0.08823529	M8	E17
0.00424669	  0.04687500	M9	E18
0.01317759	  0.14545455	M12	E19
0.00403645	  0.04455446	M13	E20
0.01041333	  0.11494253	M16	E21
0.00683743	  0.07547170	M17	E22
0.00734562	0.08108108	M18	E23

I have attached sample input file ( E_Sentences_input.txt) and expected output file (E_Sentence_expected_out.txt). So looking at the 4th column of first file (me.txt), extract the sentences as given in the expected output format. So, I want to write the code in Perl. Thanks in advance.

cheers
my_perl

Hi.

Except for some similar terms, this looks like a different problem.

You have 2 lists, essentially of tags, in increasing order, "E" followed by a decimal number, e.g. "E1", "E9" etc. The first list is a column in a file of other data, the second is a prefix to lines (some quite long).

You seem to want the tagged lines in the data file to be copied to STDOUT. Because both lists are in increasing order, this is basically a copy of tagged lines as selected by the first list, with the tag omitted, and with an empty line between them.

Does that describe the situation? ... cheers, drl

Exactly, copy the selected tagged lines (i.e not all the lines in the tagged files) to STDOUT after omitting the tags by looking up the 4th column of the first list in sequential order. (E1, E2, E3, E5, E10 etc.)

Hi.

This shell script drives the perl script, and compares the generated output to your posted expected output:

#!/usr/bin/env bash

# @(#) s2	Demonstrate exercise of selector perl code.

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) perl cmp sdiff diff
set -o nounset

echo " Lines in index file: $(wc -l <data1)"
echo " Lines in expected output: $(wc -l <expected-output.txt)"

echo
echo " Results:"
./p2 > t1
echo " Lines in output file: $(wc -l <t1)"
if cmp t1 expected-output.txt
then
  echo " Files are the same."
else
  echo " Files differ."
  sdiff -w78 -s t1 expected-output.txt
fi

exit 0

the perl script:

#!/usr/bin/perl

# @(#) p2	Demonstrate selection of tagged lines.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

my ( $f1, $f2, $i, %selectors, $junk, $key, @parts, $t1 );

open( $f1, "<", "data1" ) || die(" Cannot open data1.\n");

# Get the tags into hash selectors.

while (<$f1>) {
  $t1 = (split)[3];
  $selectors{$t1}++;
}
print " selectors is :%selectors:\n" if $debug;
close $f1;

open( $f2, "<", "data2" ) || die(" Cannot open data2.\n");

# Read data file of words, check for exisitence in selectors hash.

while (<$f2>) {
  chomp;
  @parts = split( / /, $_, 2 );
  print " Working on tag $parts[0]\n" if $debug;
  if ( not exists( $selectors{ $parts[0] } ) ) {
    print " Skipping tagged line $parts[0]\n" if $debug;
    next;
  }
  print "$parts[1]\n\n";
}

exit(0);

producing:

% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
perl 5.10.0
cmp (GNU diffutils) 2.8.1
sdiff (GNU diffutils) 2.8.1
diff (GNU diffutils) 2.8.1
 Lines in index file: 14
 Lines in expected output: 28

 Results:
 Lines in output file: 28
t1 expected-output.txt differ: char 744, line 9
 Files differ.
This was disclosed by ATSUM informati |	This was disclosed by ATSUM informati
During the meeting today, the delegat |	During the meeting today, the delegat

The 2 lines which differ from the expected output do so because there are extra embedded spaces in those specific lines in the expected output file compared to the source file.

Best wishes ... cheers, drl

Thanks a lot. It was excellent piece of work. It worked.

Cheers
my_perl