How to remove page breaks from a flat file???

kumarsaravana_s · April 16, 2007, 8:37am

Hi All,

I get a flat file with its last field data splitting onto a new line.I got this program from Vgersh which when run would cancatenate the split data back to the end of the previous records.But this program fails when it encounters a page break between the split data and the previous record.So if these page breaks are removed,then the program works fine.

Program

#!/usr/bin/ksh

BEGIN {
  FS=OFS="|"

  FLD_max=11
  
  stderr="cat 2>&1" 
}
(fld + NF-1) > FLD_max {
       if (fld == FLD_max)
          print rec
       else
          printf("Incomplete record: [%d] :: [%s]\n", FNR, rec) | stderr
       rec=$0; fld=NF;next
}
NF < FLD_max {printf("Bad record: [%d] :: [%s]\n", FNR, $0) | stderr; rec=(rec != "") ? rec $0 : $0; fld+=(NF-1);next }
{rec=$0; fld=NF}
END {
  if (rec != "" && split(rec, a, FS) >= FLD_max ) print rec
}

Input...

000000|Apr 14 2007 7:59:58:376AM| |ASDFASFSDA |000000|0|0|0|3111|SDFSDF|�PP:?��?
/there is a page break here(a kind of straight line shown in Ultra Edit,but not showing here.This needs to be removed/
��?K
000004|Apr 14 2007 7:59:58:790AM| |ASFASFAS|000000|0|0|0|111|DSFSDF|?e͢��c?
��?�d
000000|Apr 14 2007 7:59:59:970AM| |ASFAFASA |00000|0|0|0|1111|SFDSFSD|?��ק�R��RS?
00000|Apr 14 2007 8:00:01:693AM| |ASFSAFAS |000000|0|0|0|111SDFSDF|�h>`=a�?��N?��H
000000|Apr 14 2007 8:00:02:350AM| |ASFAFA|00000|0|0|0111|SDFSD1|?�
???��?
000000|Apr 14 2007 8:00:02:700AM| |ASFSAFASSA |00000|0|0|0|9964|SDFSD|3`
�"�:`��I�?9V?

Output:

000000|Apr 14 2007 7:59:58:376AM| |ASDFASFSDA |000000|0|0|0|3111|SDFSDF|�PP:?��?��?K
000004|Apr 14 2007 7:59:58:790AM| |ASFASFAS|000000|0|0|0|111|DSFSDF|?e͢��c?
��?�d000000|Apr 14 2007 7:59:59:970AM| |ASFAFASA |00000|0|0|0|1111|SFDSFSD|?��ק�R��RS?
00000|Apr 14 2007 8:00:01:693AM| |ASFSAFAS |000000|0|0|0|111SDFSDF|�h>`=a�?��N?��H
000000|Apr 14 2007 8:00:02:350AM| |ASFAFA|00000|0|0|0111|SDFSD1|?�???��?
000000|Apr 14 2007 8:00:02:700AM| |ASFSAFASSA |00000|0|0|0|9964|SDFSD|3`�"�:`��I�?9V?

Thanks
Kumar

radoulov · April 16, 2007, 9:21am

If I understand correctly the requirement with GNU Awk (on Linux, for example) you could try something like this (if all the records start with 0):

awk '$1=$1' RS="\n0"  inputfile

kumarsaravana_s · April 16, 2007, 9:43am

The records doesnt start with 0.In order to mask the actual data,i just put some dummy values while maintaining the structure of the records.The record start with two numeric formats...like 100**** and 99****

Regards,
Kumar

radoulov · April 16, 2007, 9:55am

So, what about (with GNU Awk):

awk '$1=$1{print $0 RT}' ORS= RS="\n(100|99)" inputfile

vgersh99 · April 16, 2007, 10:23am

#!/usr/bin/nawk -f

BEGIN {
  FS=OFS="|"

  FLD_max=11

  FF=sprintf("\f")
  
  stderr="cat 2>&1" 
}
$0 ~ FF { gsub(FF, ""); $1=$1 }

(fld + NF-1) > FLD_max {
       if (fld == FLD_max)
          print rec
       else
          printf("Incomplete record: [%d] :: [%s]\n", FNR, rec) | stderr
       rec=$0; fld=NF;next
}
NF < FLD_max {printf("Bad record: [%d] :: [%s]\n", FNR, $0) | stderr; rec=(rec != "") ? rec $0 : $0; fld+=(NF-1);next }
{rec=$0; fld=NF}
END {
  if (rec != "" && split(rec, a, FS) >= FLD_max ) print rec
}

kumarsaravana_s · April 17, 2007, 8:51am

vgersh99:

#!/usr/bin/nawk -f

BEGIN {
  FS=OFS="|"

  FLD_max=11

  FF=sprintf("\f")
  
  stderr="cat 2>&1" 
}
$0 ~ FF { gsub(FF, ""); $1=$1 }

(fld + NF-1) > FLD_max {
   if (fld == FLD_max)
   print rec
   else
   printf("Incomplete record: [%d] :: [%s]\n", FNR, rec) | stderr
   rec=$0; fld=NF;next
}
NF < FLD_max {printf("Bad record: [%d] :: [%s]\n", FNR, $0) | stderr; rec=(rec != "") ? rec $0 : $0; fld+=(NF-1);next }
{rec=$0; fld=NF}
END {
  if (rec != "" && split(rec, a, FS) >= FLD_max ) print rec
}

vgersh99

You are an absolute genius,i feel.It works really great.Thank you so much.

Regards,
Kumar