Reading the file line by line in Perl

filter · January 10, 2012, 10:31am

Hello Everyone,

I have written a perl script that will load the entire data file into an array and then I would check the value of the specific column and then if interested I will write to a good file else I will write it to a bad file.

But here, the problem is that if the data file is a huge file then storing in an array would cause a memory utilization issue. So i thought I have to read the data file line by line and then check for the column values.

open(FILE,$file)|| die ("could not open file $file: $!");

my (@whole, @header, @footer, @goodlines, @badlines, @fields);
my $line;
$line = $_;

@whole = <FILE>;

foreach (@whole) {
$line = $_;
@fields = split (/\|/, $line);

if($fields[57] eq " "  ||  $fields[57] eq " ")
{
 push @badlines, $line;
}

elsif( ($fields[32] eq "N.A."  ||  $fields[32] eq " ")  && ($fields[33] eq "N.A."  ||  $fields[33] eq " ") && ($fields[34] eq "N.A."  ||  $fields[34] eq " ") && ($fields[38] eq "N.A."  ||  $fields[38] eq " ") && ($fields[62] eq "N.A." ||  $fields[62] eq " "))
{
push @badlines, $line;
}

else
{
push @goodlines, $line;
}

}

open my $fh, ">", $goodfile;
print $fh @header, @goodlines, @footer;
close $fh;

open my $fh1, ">", $badfile;
print $fh1 @badlines;
close $fh1;


printf(" The New Feed file is located at --------------> '%s'\n" ,   $goodfile);
printf(" The Ignored records are located --------------> '%s'\n\n" , $badfile);

Instead of storing the entire data file into an array (memory) , could someone please advice how can I read the data file line by line so that it doesn't uses much memory.

Really appreciate your thoughts and time. Thanks a lot for looking into this.

Corona688 · January 10, 2012, 10:37am

while($LINE=<FILE>)
{
...
}

filter · January 10, 2012, 11:09am

Hi Corona688,

Thank you very much for your quick reply...

I have tried the following as you have suggested :

open(FILE,$file)|| die ("could not open file $file: $!");

my (@whole, @header, @footer, @goodlines, @badlines, @fields);
my $line;
$line = $_;

while($line=<FILE>)
{
$line = $_;
@fields = split (/\|/, $line);

if( ( $fields[20] eq "")  && ( $fields[21] == 0 || $fields[21] eq "") && ( $fields[22] == 0  ||  $fields[22] eq "") )

{
push @badlines, $line;
}
else
{
push @goodlines, $line;

}

}
open my $fh, ">", $goodfile;
print $fh @header, @goodlines, @footer;
close $fh;

open my $fh1, ">", $badfile;
print $fh1 @badlines;
close $fh1;

when I am trying to run :

[cfgdth987] $ perl create_feedfile_bonds_NAMR_OPTNPX.pl equity_option_namr.px.20120109 diff ignore
Out of memory!

I am still encountering the memory issue. Is there any way that I can read line by line and then increment the counter.

Really appreciate you time and advices.

birei · January 10, 2012, 11:17am

Hi filter,

I think Corona's suggestion is ok, but inside the loop you are saving each input line to arrays, witch are filling the memory. This piece of code:

if( ( $fields[20] eq "")  && ( $fields[21] == 0 || $fields[21] eq "") && ( $fields[22] == 0  ||  $fields[22] eq "") )  { 
push @badlines, $line; 
} 
else {
 push @goodlines, $line;  
}

Regards,
Birei

filter · January 10, 2012, 11:54am

Hi birei,

yes you are correct...Thanks to you as well.

I am trying to do run the script with the below script:

open(FILE,$file)|| die ("could not open file $file: $!");
open(OUT, ">$goodfile") or die "Can't open $goodfile";
open(OUT1, ">$badfile") or die "Can't open $badfile";

while($line=<FILE>)
{
$line = $_;
@fields = split (/\|/, $line);

if( ( $fields[21] eq "N.A.")  && ( $fields[22] == 0 || $fields[22] eq " ") && ( $fields[23] == 0  ||  $fields[23] eq " ") )

{
print OUT1 $line;
}
else
{
print OUT $line;

}
}
close(FILE);
close(OUT1);
close(OUT);

But there is some issue with the above script where I am not able to check for the column values.

Could you please help me out in solving the issue. Appreciate your thoughts!

birei · January 10, 2012, 12:14pm

Can you detail what you want to achieve, provide a sample input and expected output? Useful to give a more valuable help

Regards,
Birei.

filter · January 10, 2012, 12:53pm

Sure birei.

I have a data file which contains ~3.5Million records and would have a header and a footer.It has many number of columns with a pipe ("|") delimited.

Example:

START-OF-FILE
PROGRAMNAME=getdata
DATEFORMAT=yyyymmdd

START-OF-FIELDS
....
... (column list)

END-OF-FIELDS

TIMESTARTED=Mon Dec  9 17:35:23 EST 2011
START-OF-DATA
AAV CN 01/21/12 C10 Equity|0|43|AAV 1 C10|10.000000|Call|January 12 Calls on AAV CN|American|100.0000|20120121|110957|1000|AAV CN|00765F101|CA00765F1018|N.A.|0.040000|0.040000|N.A.|N.A.|N.A.|N.A.|0|0|CN| | |CM|EO110957201201018140000A|CAD|CA|1.073|110957|20120101|AAV CN 01/21/12 C10|N.A.|266.005|266.005|N.A.|4.320000|0.040000|390863923515|AAV| | |BBG001Q89LY8|
....
END-OF-Fields
(footer)

I need to check the values for the columns 21,22,23 and then write to a file(Badfile) if they are matched else write to a different file.(good file)

if( ( $fields[21] eq "N.A.")  && ( $fields[22] == 0 || $fields[22] eq " ") && ( $fields[23] == 0  ||  $fields[23] eq " ") )

Once the files are written , I need to include the header and footer to the Good file.

Since the data file is little huge , Memory is filling up.

Could you please help me out solving this.

birei · January 10, 2012, 3:59pm

Not sure about what are header and footer but I hope you can avoid the 'out of memory' message:

$ cat script.pl
use warnings;
use strict;

die qq[Usage: perl $0 <input-file> <output-good-file> <output-bad-file>\n] unless @ARGV == 3;

open my $bad_fh, ">", pop @ARGV or die qq[ERROR: $!\n];
open my $good_fh, ">", pop @ARGV or die qq[ERROR: $!\n];
open my $input_fh, "<", pop @ARGV or die qq[ERROR: $!\n];

my ($fields_processed, $flipflop);

while ( my $line = <$input_fh> ) {
        chomp $line;

        ## Header.
        if ( $flipflop = ( $line =~ m/\A(?i)start-of-file/ .. $line =~ m/\A(?i)start-of-fields/ ) ) {
                next if $flipflop == 1 || $flipflop =~ /E0\Z/;
                printf $good_fh qq[%s\n], $line;
                next;
        }

        ## Footer.
        if ( $fields_processed ) {
                if ( $flipflop = ( $line =~ m/\A(?i)end-of-fields/ .. eof ) ) {
                        next if $flipflop == 1;
                        printf $good_fh qq[%s\n], $line;
                }
        }

        my @f = split /\|/, $line, 25;

        if ( @f < 25 ) {
                next;
        }
        else {
                $fields_processed = 1;
        }

        if ( ( $f[21] eq "N.A.")  && ( $f[22] == 0 || $f[22] eq " ") && ( $f[23] == 0  ||  $f[23] eq " ") ) {
                printf $bad_fh qq[%s\n], $line;
        } 
        else {
                printf $good_fh qq[%s\n], $line;
        }
}

Regards,
Birei

filter · January 10, 2012, 5:31pm

Thank you very much for your reply birei.

The data file contains the header information (example: names of the columns) and the footer contains the number of the data records, filestamp.

I have ran your script ...its avoiding the "out of memory" issue. But while extracting the header information, the good file doesn't include the string "START-OF-FILE" and "START-OF-DATA" ...

       ## Header.
        if ( $flipflop = ( $line =~ m/\A(?i)START-OF-FILE/ .. $line =~ m/\A(?i)START-OF-DATA/ ) ) {
                next if $flipflop == 1 || $flipflop =~ /E0\Z/;
                printf $good_fh qq[%s\n], $line;
                next;
        }

In the footer the file stamp could be the as it is, but since the number of records have been changes in the good file...I need to count the number of reords (excluding the header information) and the replace it to with the original i.e.

footer Information:
DATARECORDS=3530288   --> Need to count the number of records in goodfile and put it over here
TIMEFINISHED=Mon Jan  9 19:24:03 EST 2012
END-OF-FILE

If I was using arrays , then I was using the below logic for the above:

my $footer_len = 4;
my $datarec_line = 1;
do {
 $line = shift @whole;
 push @header, $line;
} while $line !~ /^START-OF-DATA/;

my $n = @goodlines;
$n -= grep {/^# PRODUCT/} @goodlines;
$footer[$datarec_line] =~ s/\d+/$n/;

Could you please advice any similar logic for the header and footer information to be included to the good file .

I would really appreciate your time on this.

birei · January 11, 2012, 4:01am

To add those two lines comment this instruction:

#next if $flipflop == 1 || $flipflop =~ /E0\Z/;

Regards,
Birei

filter · January 11, 2012, 11:43am

Thanks a lot for your reply and for all your help.

I need to count the number of line in the good file excluding the header and footer and then would need to substitute the count with the number existing.

Example:
Number of records in good file without header and footer : 1418125

Before: 
END-OF-DATA
DATARECORDS=3530288
TIMEFINISHED=Mon Jan  9 19:24:03 EST 2012
END-OF-FILE

After:
END-OF-DATA
DATARECORDS=1418125
TIMEFINISHED=Mon Jan  9 19:24:03 EST 2012
END-OF-FILE

Really appreciate your time and help on this.

SFNYC · January 11, 2012, 12:32pm

In the future you may want to consider using the Tie::File module which can access the lines of a disk file via a Perl array if you cannot read a file into memory because of its size.

birei · January 12, 2012, 10:02am

Script modified:

$ cat script.pl
use warnings;
use strict;

die qq[Usage: perl $0 <input-file> <output-good-file> <output-bad-file>\n] unless @ARGV == 3;

open my $bad_fh, ">", pop @ARGV or die qq[ERROR: $!\n];
open my $good_fh, ">", pop @ARGV or die qq[ERROR: $!\n];
open my $input_fh, "<", pop @ARGV or die qq[ERROR: $!\n];

my ($fields_processed, $flipflop, $good_lines);

while ( my $line = <$input_fh> ) {
        chomp $line;

        ## Header.
        if ( $flipflop = ( $line =~ m/\A(?i)start-of-file/ .. $line =~ m/\A(?i)start-of-fields/ ) ) {
#                next if $flipflop == 1 || $flipflop =~ /E0\Z/;
                printf $good_fh qq[%s\n], $line;
                next;
        }

        ## Footer.
        if ( $fields_processed ) {
                if ( $flipflop = ( $line =~ m/\A(?i)end-of-fields/ .. eof ) ) {
                        next if $flipflop == 1;
                         $line =~ s/\A(?i)(?<=datarecords=)\d*/$good_lines/;
                        printf $good_fh qq[%s\n], $line;
                }
        }

        my @f = split /\|/, $line, 25;

        if ( @f < 25 ) {
                next;
        }
        else {
                $fields_processed = 1;
        }

        if ( ( $f[21] eq "N.A.")  && ( $f[22] == 0 || $f[22] eq " ") && ( $f[23] == 0  ||  $f[23] eq " ") ) {
                printf $bad_fh qq[%s\n], $line;
        } 
        else {
                 ++$good_lines;
                printf $good_fh qq[%s\n], $line;
        }
}

Regards,
Birei

filter · January 13, 2012, 11:01am

Thanks much for your help birei.Its really nice thought that you have given.

But the substitution is not happening i.e.

 if ( $fields_processed ) {
                if ( $flipflop = ( $line =~ m/\A(?i)END-OF-DATA/ .. eof ) ) {
                        $line =~ s/\A(?i)(?<=DATARECORDS=)\d*/$good_lines/;
                        printf $good_fh qq[%s\n], $line;
                }
        }

Footer Information:

END-OF-DATA
DATARECORDS=3530288
TIMEFINISHED=Mon Jan  9 19:24:03 EST 2012
END-OF-FILE

I think some regression expression is missing while substituting the value.

Really appreciate you thoughts and time on this.

birei · January 20, 2012, 3:48pm

Sorry but I had some busy days. Did you solve it?

In any case, substitute:

$line =~ s/\A(?i)(?<=DATARECORDS=)\d*/$good_lines/;

with

$line =~ s/(?i)(?<=DATARECORDS=)\d*/$good_lines/;

and it should work.

Regards,
Birei

filter · January 20, 2012, 5:43pm

Thanks a lot birei.

I was using the below:

                        $line =~ s/(?<=DATARECORDS=)\d*/$good_lines/;

Really appreciate your help. And below is the final code. It might help anyone else if they face any similar condition of mine.

#!/usr/bin/perl

die qq[Usage: perl $0 <input-file> <output-good-file> <output-bad-file>\n] unless @ARGV == 3;

open my $bad_fh, ">", pop @ARGV or die qq[ERROR: $!\n];
open my $good_fh, ">", pop @ARGV or die qq[ERROR: $!\n];
open my $input_fh, "<", pop @ARGV or die qq[ERROR: $!\n];



my ($fields_processed, $flipflop, $good_lines);

while ( my $line = <$input_fh> ) {
        chomp $line;

## Header.
        if ( $flipflop = ( $line =~ m/\A(?i)START-OF-FILE/ .. $line =~ m/\A(?i)START-OF-DATA/ ) ) {
               # next if $flipflop == 1 || $flipflop =~ /E0\Z/;
                printf $good_fh qq[%s\n], $line;
                next;
        }

        ## Footer.
        if ( $fields_processed ) {
                if ( $flipflop = ( $line =~ m/\A(?i)END-OF-DATA/ .. eof ) ) {
                        $line =~ s/(?<=DATARECORDS=)\d*/$good_lines/;
                        printf $good_fh qq[%s\n], $line;
                }
        }

my @f = split /\|/, $line, 25;

        if ( @f < 25 ) {
                next;
        }
        else {
                $fields_processed = 1;
        }

if( ($f[21] == 0 || $f[21] eq " " ||  $f[21] eq "N.A.")  && ( $f[22] == 0 || $f[22] eq " " || $f[22] eq "N.A.") && ( $f[23] == 0  ||  $f[23] eq " " || $f[23] eq "N.A.") )

{
                printf $bad_fh qq[%s\n], $line;

}
else
{
                ++$good_lines;
                printf $good_fh qq[%s\n], $line;


}

}

Thanks once again for all your effort,help and time.