Removing Headers and a Column

DerangedNick · January 29, 2008, 5:43pm

I have a text file in unix with a layout like this

Column 1 - 1-12
Column 2 - 13-39
Column 3 - 40-58
Column 4 - 59-85
Column 5 - 86-120
Columbn 6 - 121-131

The file also has a header on the first 6 lines of each page. Each page is 51 lines long. So I want to remove the header from each page first off and rewrite the file (which I have been able to figure out for the most part, it just isn't very clean yet)

All I have been able to do is rewrite the header with blank spaces, which just gives me a mess in the first place.

But here is the ultimate goal. Remove the first 6 lines of each page and remove column 4 from the entire report. Rewrite the report with a new header and just put all of the data in a new report, excluding column 4.

Honestly I haven't really done anything quite this complex (at least it seems complex to me), so I am not really sure where to get started. The most I have really done is just rewriting strings in a text file or removing specific words.

Any help would be appreciated with this.

One thing I forgot to mention and I am not even sure that this is possible. At the very end of the report is a totals section. It has a specific word identifying where it starts and i'd need to reprint from where it starts down at the end of the new report. I am not even sure if this is possible because removing column 4 would obviously cut into that section. So it seems the only way to save it would be to write that one area into a new file and appending it at the end of new report once it has been completed. I understand the concept, the matter is figuring out how to do it.

Thinking about it a little more, if it was possible to leave the headers and just ignore removing column 4 from that section on each page, that would work as well. I actually do not mind the headers there and it may create printing problems if I do not keep each page 51 lines long. I imagine i'll deal with it but if there was another way I am sure I could handle that as well. Also this may work for the final report at the very end. Just telling it to ignore the last 12 lines of the report, somehow. Just a thought.

Thank you

Smiling_Dragon · January 29, 2008, 6:35pm

#!/usr/bin/perl -w
$PAGESIZE=51;
$HEADERSIZE=6;
$linenumber=0;
$intotals=0;
while (<>) {
  $linenumber++;
  if (/^whatever line indicates the start of the "totals" section$/) {
    $intotals=1;
  }
  if ($intotals) {
    print $_;
  } elsif ($linenumber % $PAGESIZE >= $HEADERSIZE) {
    if (/^(.{58}).{27}(.*)$/) {
      print "$1$2\n";
    } elsif (/^(.{58}).{1-27}$) {
      print $1\n";
    } else {
      print $_;
    }
  }
}

Untested and you'll have to replace "whatever line indicates the start of the "totals" section" with something sensible.

DerangedNick · January 29, 2008, 7:04pm

I am currently looking at trying to use the script you provided. However my knowledge of running this against the file is rather slim since most of the commands I have run in the past do not call a script into it. If you wouldn't mind providing some more information on how to get this to run against the file i'd appreciate it. In the mean time I will continue messing with it to see if I can get anything. Thanks for the help. (Ignore above)

I seem to have gotten it to run ok, but i am getting these errors currently.

syntax error at testscript line 12, near "<>"
syntax error at testscript line 15, near "} else"
Execution of testscript aborted due to compilation errors.

Smiling_Dragon · January 29, 2008, 7:08pm

Oops, my bad, have fixed it in the original post
(change the <> to !=)

DerangedNick · January 29, 2008, 7:19pm

The script ran through the file and gave me a output, however it didn't remove the 4th column. Everything seems to be there that was there originally but it is scattered all over the place instead of in columns. Not really sure.

What part of the request was the script addressing? I will keep playing with it for the time being to see if I can get different results. Thanks for the help

Looking over the file again it does seem to have removed something but i am not quite sure at which point yet. Will
update once I know. I do know that alot of the data that I wanted removed is still in place however.

::Update::
Ok what it appears to be doing is once it removes the columns on the first line, it is then pulling the second line up to the first line and going to the second line and removing that same section on the second line and so on down the entire document. The totals section appears to be in tact, however it did lose its formating so it is rather hard to tell since it is scattered.

Thanks

Smiling_Dragon · January 29, 2008, 7:25pm

Yeah, I had some bugs It should do everything you are after (I hope)

Fixed more bugs in the orginal:
Added \n to the print $1$2 line
Replaced the / symbol in the pagebreak calculation with % (modulo arithmatic)

Edit: Woops, didn't read your request right - I've been removing the first line of each page, not the first 6... Will fix...

DerangedNick · January 29, 2008, 7:31pm

Ok this one looks alot better. Totals are in tact however it needs to start cutting off 1 character earlier (which I think I may be able to change).

The problem however now is that some lines do not have data at the beginning of the lines, but column 4 does have data in it (so 1,2,3,5,6 are blank). This is still being printed it is just moving over into what was column 5.

The next part is that it is just cutting sections of the header out, i don't know if this can be fixed or not.

I will try to fix the width issue. I am not sure where to start on getting it to cut out the other parts of column 4 though

Thanks alot for all the help.

I'd rather not remove the first 6 of the lines if we can just ignore those lines somehow? They all start with the same thing (except there are multiple starts to each line of the header.)

This is how the first 6 lines of each page look
Line 1: (this has a square control character) I imagine it is used as the page sep
Line 2: XXXXXX (always the same, different word obviously)
Line 3: ALL
Line 4: (blank line)
Line 5: ACCOUNT (4 blank spaces before this)
Line 6: --------- (4 blank spaces before this)

Line 7 is blank and data starts under that. That is how the header begins on each page. If it was possible to ignore that the entire way down that would be ideal.

Last error:

Name "main::HEADERSIZE" used only once: possible typo at testfile line 3.

Smiling_Dragon · January 29, 2008, 7:57pm

It should now remove 6 lines of header at a time ($HEADERSIZE)
If you need to keep the first 6 lines, it's just a matter of changing the header calculation line to:

  } elsif (($linenumber > $HEADERSIZE) && ($linenumber % $PAGESIZE > $HEADERSIZE)) {

It will still remove column 4 from that first header though.

DerangedNick · January 29, 2008, 8:03pm

It seems like it is removing 6 lines from the top, then counting down 51 lines and removing 6, but as if it shifted the lines up first. I can't really tell but as it goes down the report it is slowly removing data and leaving pieces of the header. I double checked the page size and it is still correct. I am not sure.

The only other problem besides that is when there is nothing else in a line but column 4, that is still showing in what was column 5.

Perhaps it'd be easier if we excluded the first 6 lines from each page? I will see what I can come up with. Thanks alot

It seems like my page size was off, it is 52 not 51. I changed this and it worked for about half of the report. For some reason about half way down it just stopped taking off the headers all together, i am not really sure why. The rest of the report from about half way down is intact and it just does not remove anymore headers. Odd

Any advise welcome. Thank you

As for the header, if it removes the header I am fine with that I will just make a template file to at least put a header at the top of it before it goes to printing, unfortunately I think without keeping the header it may just print and not keep any page breaks (since this will end up on a line printer). So keeping all the headers intact would be nice but regardless removing data from lines may cause the same problem. So it may not be a big deal. Thanks

Smiling_Dragon · January 29, 2008, 8:23pm

Fixed the problem when dealing with short lines (edited original script again)
If it stopped cutting columns out halfway down, it probably encountered the text that suggests you're into the totals section.

The headersize calculation should work better with >= in it instead of >

This will also allow you to turn off the header removal part by setting HEADERSZE to 0 at the top of the script.

DerangedNick · January 29, 2008, 8:32pm

Search pattern not terminated at ./testscript line 16.

I will try to find it.

I have tried to find it but i am still not able to find it. I will keep looking for the time being but any help would be appreciated.

Ok i found the error, there appears to have been two. / missing on line 16 and a " on line 17. It runs however it is about the same. I do not see much of a difference. Starting about half way in it does the same thing and it is still not grabbing the lines that do not have anything in the other columns.

It continues pulling out column 4 the entire file after it stops removing headers however and it even pulls column 4 out of the totals section. Any help appreicated.

Thank you again for all the help

Ok this is what I have left that needs to be resolved.

Resolved.

syntax errors have been resolved to the best of my abilities.
Headers while still being cut into, I can replace the text that is being taken out, so I can deal with that.

Not resolved

Totals are still being cut into, it is printing but Column 4 is being taken out of the totals.
Column 4 where Columns 1/2/3/5/6 have something is still being printed. If I could get some type of script that could just remove a line if it contains X that would work. Each of the 4 lines under each record that is still printing in column 4 will always start with the same word. Maybe that would be easier?

Thanks again

Smiling_Dragon · January 29, 2008, 10:56pm

I've tested out the code, (found the same bugs you found and fixed them):

#!/usr/bin/perl -w
$PAGESIZE=52;
$HEADERSIZE=6;
$linenumber=0;
$intotals=0;
while (<>) {
  $linenumber++;
  if (/^Totals$/) {
    $intotals=1;
  }
  if ($intotals) {
    print $_;
  } elsif ($linenumber % $PAGESIZE > $HEADERSIZE) {
    if (/^(.{58}).{27}(.*)$/) {
      print "$1$2\n";
    } elsif (/^(.{58}).{1-27}$/) {
      print "$1\n";
    } else {
      print $_;
    }
  }
}

I don't see the same behaviour you report.
What does the line that indicates you are entering the totals section look like?
What expression have you used on line 8 to search for this?

Do you want the headers at the top of each page removed still?
If not, change the headersize to 0.

DerangedNick · January 29, 2008, 11:05pm

I changed the Header Size to 0 - however it still removes Column 4, but I did find a work around using the sed command to fix it. (may not be the easiest but it works).

On line 8 i used "Organization Totals" "Organization" and just "Organ" trying to get it to pick it up (each one by itself). The totals start 13 lines up from the bottom, but it still cuts out column 4 for some reason. The line Organization does contain alot more information besides that, could that be the problem? I just put what the line starts with.

Also it leaving the 4 lines under each record is still the only other thing that it is doing.

I am still not sure why.

Thanks

Also if it was possible to just add a command in there that said remove all lines that contain "Random Word1", "Random Word2", "Random Word3", and "Random Word 4" None of the sayings in the 4 lines being left behind should ever appear in any other part of this document, that would resolve that issue.

Would just leave the totals area needing to be fixed.

It seems that the 4 lines being left behind do not actually go to the end of the line, it is a columb 4 by itself but they are not the full length of columb 4 because there is not a 5. So on these 4 lines, it starts at 59 but if it is only 10 characters the line would end at 69 instead of 85. Perhaps this is causing the problem?

ghostdog74 · January 30, 2008, 1:06am

An example speaks more than a thousand words. you can provide some input sample and then describe how you want the output to look like.

DerangedNick · January 30, 2008, 1:36am

Due to the nature of the data I can't provide an actual example. However I have attached a file that I placed a T where the text is, a N where numbers are and a b for the data I do not want.

I showed how the data looks originally, what I want it to look like and what it is coming out as.

I also attached the totals section, hopefully it makes since. If you have any questions let me know.

KevinADC · January 30, 2008, 3:22am

I'm impressed how much Smiling Dragon did with such vague and confusing requirements to go by. With the benefit of seeing the sample data I came up with this.

#!/usr/local/bin/perl
use strict;
use warnings;
while(<>){
    next if $. < 7; # skip header lines
    if (/Organization Totals/) {
        print;
        print <>; # prints untill eof
        exit();
    }
    {
        no warnings;
        next if (/^\s{58}/);
        print substr $_,0,58;
        print substr $_,85;
    }
}

the "no warnings" block is in there because I have no idea how the lines of the real data file are formatted. If there are blank lines the substr() function will throw warnings about "substr outside of string at blah blah blah" and the print lines will throw warnings as well. You could check the length of each line to avoid this but it is probably not necessary unless all lines need to be padded to a certain length.

You say the file is 51 lines long but the sample data only accounts for about 27 lines.

I assume you know how to direct the input and output when running the script.

KevinADC · January 30, 2008, 3:27am

the output from using the sample data:

      NNNNN TTTTTTTTT,  T T            TTTT NNN NNN-NNNN  NN NNTTTNN      NN.NN-       NN.NN-  NN       
            NNNNN TTTTTT TTTTTT        TTTT NNN NNN-NNNN  NN NNTTTNN       N.NN+       NN.NN-    
        TNN TTTT TTTTT, TT NNNNN                          NN NNTTTNN      NN.NN-       NN.NN-    
                                                          NN NNTTTNN     NNN.NN-        N.NN     

Organization Totals      tttttt O/D  tttttt  %    t/t ttttttt   %                   $ttttt  tttttt  %    t/t ttttttt   %
                     tttt:  nn Days   nnn  nn.n     n,nnn.nn  nn.n           Thru:  $nn.nn    nn   n.n        nn.nn   n.n
                            nn Days    nn  nn.n     n,nnn.nn  nn.n                  $nn.nn    nn  nn.n       nnn.nn   n.n
                            nn Days    nn   n.n     n,nnn.nn  nn.n                  $nn.nn    nn  nn.n     n,nnn.nn   n.n
                            nn Days    nn   n.n     n,nnn.nn  nn.n                  $nn.nn    nn  nn.n       nnn.nn   n.n
                            nn Days     n               n.nn                        $nn.nn    nn  nn.n     n,nnn.nn   n.n
                           nnn Days     n               n.nn                       $nnn.nn    nn  nn.n     n,nnn.nn  nn.n
                           nnn Days     n               n.nn                       $nnn.nn    nn  nn.n     n,nnn.nn  nn.n
                           nnn Days     n               n.nn                     $n,nnn.nn     n   n.n     n,nnn.nn  nn.n
                    ttttt: nnn Days     n               n.nn              ttttt: $n,nnn.nn     n   n.n     n,nnn.nn  nn.n
                                     ----         ----------                                ----         ----------
                                      nnn         $nn,nnn.nn                                 nnn         $nn,nnn.nn

DerangedNick · January 30, 2008, 8:51am

The script you provided seems to have done the trick for the most part. It does cut into the header (there is a header at the top of each page) but I am fine with that. I replaced all of the text that was cut out using the sed command and it comes out decent. The only thing I can see that may be annoying is while printing it may still use the same amount of pages if I do not cut out the page break that is in the file, but I will have to look at it a little further to see what I may be able to do about that.

The character that seperates the pages is causing a problem, I am working on trying to replace it with a space or something, but then the headers will not end up at the top of the page. Not to sure what I am going to do with this one yet.

All in all thank you both for all the help, I will deal with the minor things I need to do with it from here. On the upside I did learn a good bit out of doing this so I suppose that is a upside.

Side note, the script does seem to be removing the 4th line off of column 5. I will keep playing with it to see if I can figure out why. If there are 4 lines in column 5, it just leaves 3. If there are less than 4 it just keeps whatever is there.

Thanks again.

KevinADC · January 30, 2008, 1:53pm

Your original explanation included the widths for each column, that is what I used. If the width varies then there will be problems. If they are fixed-width columns this is a pretty simple task, but if they are not, trying to verify the data and extract a column looks to be very difficult, very very difficult.

If you want to keep the header remove the line that "skips the header" or adjust the number 7 to 6 or 8 and see if one of those works better.

Good luck.

DerangedNick · January 30, 2008, 1:59pm

The width of the columns do appear to be fixed. It is just on a line that has no data but column 4 it seems to be removing the data in column 5 on the bottom.

Example:

Address is 4 lines long (column 1), so it will keep all 4 lines in column 5. If the address is 3 lines long, it will only keep 3 lines of column 5.

The header does not appear to be a issue though, it just removes the very top which is fine.