Copying the Header & footer Information to the Outfile.

filter · August 25, 2011, 5:26pm

Hi

I am writing a perl script which checks for the specific column values from a file and writes to the OUT file.

So the feed file has a header information and footer information.

I header information isaround107 lines i.e.
Starts with

START-OF-FILE
....... 
so on ....

TIMESTARTED=Thu Aug 25 01:03:50 BST 2011
START-OF-DATA
# PRODUCT=Corp/Pfd

After the last line "# PRODUCT=Corp/Pfd" the actual data would start.

The footer information is 4 lines i.e.

END-OF-DATA
DATARECORDS=1275983
TIMEFINISHED=Thu Aug 25 02:27:02 BST 2011
END-OF-FILE

Now, My perl script is as below:

#!/usr/bin/perl

$file='file';
open(FILE,$file)|| die ("could not open file $file: $!");  # note minor changes in this line, too
open(OUT1,'>','badfile');
open(OUT2,'>','goodfile');
my @fields;
$line = $_;

while (<FILE>) {

$line = $_;
@fields = split (/\|/, $line);
<<<<<< 1)  Here Before going to check the column values, I need to write the HEADER and FOOTER information to the Goodfile. >>>>>>>>>

if( $fields[32] eq "N.A."  && $fields[33] eq "N.A." && $fields[34] eq "N.A." && $fields[38] eq "N.A." && ($fields[62] eq "N.A." ||  $fields[62] eq " "))
{
print OUT1 $line;   -----> Badfile
}

else
{
    print OUT2 $line;                ----> Goodfile
}
}
close FILE;
close OUT1;
close OUT2;

1)Here Before going to check the column values, I need to write the HEADER and FOOTER information to the Goodfile

2) Also, I need to calculate the Number of Records in the Good file and then change the FOOTER Information as:

END-OF-DATA
DATARECORDS=1275983   --> New Rowcount from the Goodfile
TIMEFINISHED=Thu Aug 25 02:27:02 BST 2011
END-OF-FILE

Could anyone please help me out in solving this. Help would be really appreciated.

yazu · August 25, 2011, 9:54pm

Hi!

The simplest way is to read the whole file in array, split it to four parts, process then and write the result in the output file. Because it's really simple and quick then perhaps you should do it in that way. There are a lot of things in the world else you can do or improve or learn.

But... There is always but, you know. It is definitely not "unix way". Why?

Well. From the famous "The UNIX Time-Sharing System": "... there have always been fairly severe size constraints on the system and its software. Given the partially antagonistic desires for reasonable efficiency and expressive power, the size constraint has encouraged not only economy, but also a certain elegance of design."

You don't believe if I say what recourses did have the first Unix hosts. So I wouldn't - but the word "severe" says for itself. At those time the famous "unix philosophy" was born.

Doug McIlroy summarized it in this way: "This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."

You can read more here - in the free and good book The Art of Unix Programming.

And what relation does all this stuff have to your question? Just see:

You need the header:

sed -n '/^START-OF-FILE/,/^START-OF-DATA/p' INPUTFILE >/tmp/header.$$

The footer:

sed -n '/^END-OF-DATA/,$p' INPUTFILE >/tmp/footer.$$

You can process your file with your perl script but print the name of your good file in the end of the script:

goodfile=$(perl process.pl)

Or you can print both names - good and bad one and then split them. Or you can give this name as the argument to the script. You just need to know this name.

What is the number of records(lines) in the goodfile?

goodrecs=$(wc -l "$goodfile")

The new footer:

sed 's/^DATARECORDS=.*$/DATARECORDS='"$goodrecs"'/' /tmp/footer.$$ >/tmp/newfooter.$$

Now

cat /tmp/header.$$ "$goodfile" /tmp/newfooter.$$ >OUTPUTFILE

And don't forget to clean after you:

rm /tmp/header.$$ /tmp/*footer.$$ # maybe the goodfile too

The beauty of the shell programming that you can do it incremental, in small pieces. You can test and debug your steps separately. And then, when you get the result, you just append your steps in a small, elegant, and really unix program - a shell script.

Regards,
Andrey (yazu)

===

Well. Sorry for my English. This post was really my English exercise.

g.pi · August 25, 2011, 10:23pm

@Andrey (yazu), thanks for the link to the book. I appreciate it.

GP

filter · August 26, 2011, 12:16am

Hi yazu,

Really appreciate for your post. Thanks a lot for your answer and thoughts.

The simplest way is to read the whole file in array, split it to four  parts, process then and write the result in the output file. Because  it's really simple and quick then perhaps you should do it in that way.  There are a lot of things in the world else you can do or improve or  learn.

yes you are correct. I did tried the logic to save the entire file into an array and then tried to divide the parts.

But I was struck to do the following points Inside the script:
1) How to write the footer information into the goodfile inside the perl script.
2) Thought of using a counter to calculate the number of lines and then how do I substitute the number in the footer information.

Really appreciate your thoughts using Unix and I did learn a lot from your post.

Is there any way we can do the same in Perl Script itself.

Thanks a lot for your replies.

yazu · August 26, 2011, 6:28am

Ok. Let's take a such example file:

cat INPUTFILE
START-OF-FILE
....... 
so on ....

TIMESTARTED=Thu Aug 25 01:03:50 BST 2011
START-OF-DATA
# PRODUCT=Corp/Pfd
a
b
1
c
d
3
END-OF-DATA
DATARECORDS=1275983
TIMEFINISHED=Thu Aug 25 02:27:02 BST 2011
END-OF-FILE

Good lines are numbers and all others are bad lines. So here a sketch:

perl -e '                                                              :( 
use warnings;
use strict;

my $goodfile = "goodfile";
my $footer_len = 4;
my $datarec_line = 1;

my (@whole, @header, @footer, @goodlines, @badlines);
my $line;

@whole = <>;

do {
  $line = shift @whole;
  push @header, $line;
} while $line !~ /^START-OF-DATA/;

@footer = splice @whole, -$footer_len;

for (@whole) {
  if (/\d/) {
    push @goodlines, $_;
  } else {
    push @badlines, $_;
  }
}

$footer[$datarec_line] =~ s/\d+/scalar @goodlines/e;

open my $fh, ">", $goodfile;
print $fh @header, @footer, @goodlines;
close $fh;

print @badlines
' INPUTFILE

Good records go to the goodfile and bad ones to the stdout. The footer is before good records.
You can change this sketch (the definition of good lines, the order of output, the output of bad lines) as you want.

filter · August 26, 2011, 1:40pm

Hi Yazu,

Really Excellent logic when I have seen your code. Thank you very much for your time and for your thoughts.

I have modified the logic accordingly and below is the code:

#!/usr/bin/perl

$file='feedfile';
open(FILE,$file)|| die ("could not open file $file: $!");


my $goodfile = "goodfile";
my $badfile = "badfile";
my $footer_len = 4;
my $datarec_line = 1;

my (@whole, @header, @footer, @goodlines, @badlines, @fields);
my $line;
$line = $_;

@whole = <FILE>;

do {
  $line = shift @whole;
  push @header, $line;
} while $line !~ /^# PRODUCT/;

@footer = splice @whole, -$footer_len;


foreach (@whole) {
$line = $_;
@fields = split (/\|/, $line);

if( $fields[57] eq " ")
{
 push @badlines, $line;
}

elsif( $fields[32] eq "N.A."  && $fields[33] eq "N.A." && $fields[34] eq "N.A." && $fields[38] eq "N.A." && ($fields[62] eq "N.A." ||  $fields[62] eq " "))
{
push @badlines, $line;
}

else
{
push @goodlines, $line;
}

}

$footer[$datarec_line] =~ s/\d+/scalar @goodlines/e;

open my $fh, ">", $goodfile;
print $fh @header, @goodlines, @footer;
close $fh;

open my $fh1, ">", $badfile;
print $fh1 @badlines;
close $fh1

After running the code I have found that there are 4 lines in between the data records that are differentiate the data.
i.e.

grep -n "#  PRODUCT"  feedfile
1206675:# PRODUCT=Convertible 
1261566:# PRODUCT=Nationals
1270395:# PRODUCT= Agencies
1274335:# PRODUCT=Regionals

As above we can see that these 4 lines are invalid records.

Now, while calculating the Rowcount we need to ignore these 4 records. i.e.

$footer[$datarec_line] =~ s/\d+/scalar @goodlines/e;

Here while calculating the rowcount and substituting the new count, we have to ignore the above 4 lines(records).

May be reducing the array by 4. not sure though.

How can we reduce the row count by 4 so that we can get the actual count.

Really appreciate your time and thoughts.

---------- Post updated at 01:40 PM ---------- Previous update was at 01:24 PM ----------

Finally,

I did the following :

$footer[$datarec_line] =~ s/\d+/(scalar @goodlines - 4)/e;

Thanks a lot Yazu. I am really Very much thankful to you.

yazu · August 26, 2011, 1:44pm

my $n = @goodlines;
$n -= grep {/^# PRODUCT/} @goodlines; # or just $n -= 4 but it's not good
$footer[$datarec_line] =~ s/\d+/$n/;

filter · August 26, 2011, 5:13pm

Hi Yazu,

Thanks a lot for your reply.

When I am using the below line:

$n -= grep {/^# PRODUCT/} @goodlines;

The "#" sign is actually commenting the entore line from there.i.e

# PRODUCT/} @goodlines;

commenting the above.

Is there any way to check for the above lines ?

Thanks much for your replies.

yazu · August 26, 2011, 9:58pm

No. # is an usual char in string literals and regexes :

cat >INPUTFILE
1
# comment
3
#
#
4
$ perl -lne '!/^#/ && print' INPUTFILE 
1
3
4

What the error did you get?

filter · August 27, 2011, 12:09am

I am Sorry Yazu. Its was my mistake.

You are Excellent.

I have learnt a lot more from your posts. Really appreciated for your knowledge sharing.

Bunch of thanks to you. you rock!

filter · August 29, 2011, 11:42am

Hi Yazu/All,

I have been trying to write the same logic in a different way:
i.e. Instead of loading the file into an array (memory), I would like to read the file line by line and then check for the conditions:
The code looks lile:

my $goodfile;
my $badfile;
my $footer_len = 4;
my $datarec_line = 1;
my ( @header, @footer, @goodlines, @badlines, @fields);
my $line;
$line = $_;
while (<FILE>)
{
$line = $_;
@fields = split (/\|/, $line);

if( $fields[57] eq " ")
{
 push @badlines, $line;
}

elsif( $fields[32] eq "N.A."  && $fields[33] eq "N.A." && $fields[34] eq "N.A." && $fields[38] eq "N.A." && ($fields[62] eq "N.A." ||  $fields[62] eq " "))
{
push @badlines, $line;
}

else
{
push @goodlines, $line;
}

}

@footer = splice @goodlines, -$footer_len;

my $n = @goodlines;
$n -= grep {/^# PRODUCT/} @goodlines;
$footer[$datarec_line] =~ s/\d+/$n/;

open my $fh, ">", $goodfile;
print $fh  @header, @goodlines, @footer;
close $fh;

open my $fh1, ">", $badfile;
print $fh1 @badlines;
close $fh1;

Here:
The script runs perfectly. It actually read line by line and then check for conditions.
But,
The header contains 107 lines i.e.

START-OF-FILE .......  so on ....
# PRODUCT=Corp/Pfd   --> 107 Line

Since its 107 lines, while I am counting the number of lines its calculating the entire number of lines including the header and then doing the grep and substituting but instead it should exclude the header and do the remaining.

Other than that The footer looks good.

Could you please tell me whether is there any way to that after inserting the good lines into an array, calculating the number of lines without the header ?

Really appreciate your replies.

filter · December 20, 2011, 4:20pm

Hi Yazu / All,

I am trying to implement and extend the logic more further.

The only issue is that the memory is reporting as full if a huge file is being passed to the script.

Currently the script is using the below logic:

while (<FILE>) { $line = $_; @fields = split (/\|/, $line);

Is there anyway we can use a hash or any other approach ?

Thanks for all your time and thoughts!