Split long record into csv file

wvdeijk · April 27, 2009, 11:48am

Hi

I receive a mainframe file which has very long records (1100 chars) with no field delimiters. I need to parse each record and output a comma delimited (csv) file. The record layout is fixed. If there weren't so many fields and records I would read the file into Excel, as a "fixed width" file and manually split the record into it's separate components, but that is too time consuming and anyway, there are way too many records.

I was thinking of doing something in awk, like
read line
a=substr(line,1,5)
b=substr(line,6,2)
etc for each of the 226 fields
write a,b,c.......

but I'm sure there is a better way.

Any help will be much appreciated.

vgersh99 · April 27, 2009, 12:16pm

do you know the width of every field?
Can you come up with the list?

joeyg · April 27, 2009, 1:50pm

Because I often have to read thru a fixed record file, I created the following function. It shows a basic way of defining fields and making a csv file. After defining the function, I often do something like:

grep "^abc" myfile | eoc2csv >20090427file.csv

FSvar=","      ### set FS variable for field separator
eoc2csv ()
{
   awk -v FSvar=$FSvar '
     {FS=IFS=OFS=FSvar}
     {
     SI=substr($0,1,13)
     NA=substr($0,14,30)
     AD=substr($0,44,92)
     GR=substr($0,136,9)
     CC=substr($0,164,6)
     UA=substr($0,170,4)
     DT=substr($0,180,3)
     ED=substr($0,183,8)
     DS=substr($0,251,8)
     JN=substr($0,271,10)
     SQ=substr($0,293,8)
     ET=substr($0,397,1)
     GT=substr($0,401,2)
     SD=substr($0,451,10)

     print SI,NA,AD,GR,CC,UA,DT,ED,DS,JN,SQ,ET,GT,SD
   }'
   return
}

Also seen where someone
cut -c1-5 myfile >file001
cut -c6-15 myfile >file002
and so on
paste -d"," file001 file002 ... >newfile

Not really any easier way I can think of - one way or another you will have to define all those fields.

vgersh99 · April 27, 2009, 2:00pm

my favorit is from comp.lang.awk - that's if you don't have 'gawk'. If you do have 'gawk' installed, it already has the 'FIELDWIDTHS' capability built in.
You enhance that by passing the 'FIELDWIDTHS' on cli:

function setFieldsByWidth(   i,n,FWS,start,copyd0) {
  # Licensed under GPL Peter S Tillier, 2003
  # NB corrupts $0
  copyd0 = $0                             # make copy of $0 to work on
  if (length(FIELDWIDTHS) == 0) {
    print "You need to set the width of the fields that you require" > "/dev/stderr"
    print "in the variable FIELDWIDTHS (NB: Upper case!)" > "/dev/stderr"
    exit(1)
  }

  if (!match(FIELDWIDTHS,/^[0-9 ]+$/)) {
    print "The variable FIELDWIDTHS must contain digits, separated" > "/dev/stderr"
    print "by spaces." > "/dev/stderr"
    exit(1)
  }

  n = split(FIELDWIDTHS,FWS)

  if (n == 1) {
    print "Warning: FIELDWIDTHS contains only one field width." > "/dev/stderr"
    print "Attempting to continue." > "/dev/stderr"
  }

  start = 1
  for (i=1; i <= n; i++) {
    $i = substr(copyd0,start,FWS)
    start = start + FWS
  }
}

#Note that the "/dev/stderr" entries in some lines have wrapped.

#I then call setFieldsByWidth() in my main awk code as follows:

BEGIN {
  #FIELDWIDTHS="7 6 5 4 3 2 1" # for example
  FIELDWIDTHS="1 3 8 8 5 9 1 9" # for example
  OFS="|"
}
!/^[  ]*$/ {
  saveDollarZero = $0 # if you want it later
  setFieldsByWidth()
  # now we can manipulate $0, NF and $1 .. $NF as we wish
  print $0 OFS
  next
}

wvdeijk · April 27, 2009, 3:14pm

Thanks. I do know all the field widths. I'll try vgersh99's method and seee how it goes. Unfortunately, we don't have gawk installed, but that just means I'll have to create the function as well.

wvdeijk · May 7, 2009, 2:24pm

Finally managed to implement the setFieldsByWidth solution, and got it working once I realised that the function is an awk function, not a shell function (which would reside in my $FPATH):o. I do however need a little more help if possible - I have a lot of fields (200+) and the awk script is now erroring out because the FIELDWIDTHS variable is (much) longer than 399 characters. Is there an easy way round this.

A second problem is that I have 2 different record type in the file, which would require 2 different FIELDWIDTHS variables. Is it possible to do this, or would it be better to split the input file into 2 separate files before parsing?

Thanks for your help.

vgersh99 · May 7, 2009, 2:51pm

If you're on Solaris, try either /usr/bin/nawk or /usr/xpg4/bin/awk - you might get higher limits.

sure. assuming you can programmatically determine the 'record type'...
set up 2 FIELDWIDTH variable: FIELDWIDTH1 and FIELDWIDTH2 with corresponding values. Implement the code to determine the 'record type'. Then

if (recordType == recordType1) ? setFieldsByWidth(FIELDWIDTH1) : setFieldsByWidth(FIELDWIDTH2)

Your 'setFieldsByWidth' function declaration would change to:

function setFieldsByWidth(FIELDWIDTH,       i,n,FWS,start,copyd0)

wvdeijk · May 7, 2009, 3:01pm

I'm on AIX, so don't have the xpg4/awk version available. nawk seems to have the same 399 character limit.
Would it be possible to set up FIELDWIDTHS as an array? (in other words, set up the FWS variable directly instead of through split). Are there limits to array sizes as well?

vgersh99 · May 7, 2009, 3:19pm

Hm.......hate those artificial limits.
Sure, you can set the array directly, it's just tedious and not very nice to look at.

FIELDWIDTHS="1 3 8 8 5 9 1 9" # for example

#would translate into

FWS[1]=1
FWS[1]=3
FWS[1]=8
FWS[1]=8
FWS[1]=5
FWS[1]=9
FWS[1]=1
FWS[1]=9

Another thing you can try is pass the FIELDWIDTH verable on the 'awk command line (commenting out the hard-coded one in the script) - like so

awk -v FIELDWIDTH='1 3 8 8 5 9 1 9' -f myAWKscript myFile2process

but I bet the result will be the same - doesn't hurt to try.

wvdeijk · May 7, 2009, 3:38pm

Thanks. As you say, that wouldn't look very nice. I tried to be a bit creative, and it actually worked:D

  
FIELDWIDTHS1="8 8 18 1 15 6 10 1 8 1 1 3 12 7 9 10 8 2 2 4 8 10 3 2 1 3 3 2 2 3 2 5 1 2 2 1 9 2 3 12 1 1 12 1 2 3 7 2 8 1 2 8 4 6 4 4 4 4 4 4 4 4 4 4 4 4 9 12 9 10 1 8 1 2 2 2 1 1 1 1 1 4 2 5 8 1 2 1 1 1 4 12 12 13 1 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 3 3 3 3 3 3 3 8 10 14 12 8 2 2 13 8 8 12 12 12 12 12 1 6 12 12 12 10 3 2 11 10 1 13 10"
  FIELDWIDTHS2=" 1 2 6 6 6 5 5 5 1 13 1 1 1 7 9 13 13 29" # for example

This created 2 variables, both under the 399 character limit.

Then, in the setFieldsBy Width function

  n = split(FIELDWIDTHS1 FIELDWIDTHS2,FWS)

Thanks for your help.

vgersh99 · May 7, 2009, 3:45pm

wvdeijk:

Thanks. As you say, that wouldn't look very nice. I tried to be a bit creative, and it actually worked:D
  
FIELDWIDTHS1="8 8 18 1 15 6 10 1 8 1 1 3 12 7 9 10 8 2 2 4 8 10 3 2 1 3 3 2 2 3 2 5 1 2 2 1 9 2 3 12 1 1 12 1 2 3 7 2 8 1 2 8 4 6 4 4 4 4 4 4 4 4 4 4 4 4 9 12 9 10 1 8 1 2 2 2 1 1 1 1 1 4 2 5 8 1 2 1 1 1 4 12 12 13 1 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 4 13 12 3 3 3 3 3 3 3 8 10 14 12 8 2 2 13 8 8 12 12 12 12 12 1 6 12 12 12 10 3 2 11 10 1 13 10"
  FIELDWIDTHS2=" 1 2 6 6 6 5 5 5 1 13 1 1 1 7 9 13 13 29" # for example
This created 2 variables, both under the 399 character limit.

Then, in the setFieldsBy Width function
  n = split(FIELDWIDTHS1 FIELDWIDTHS2,FWS)
Thanks for your help.

Ah........ that's a GREAT idea - it totally escaped my mind.
Good thinking!