I need some help creating a tidy shell program with awk or other language that will split large length files efficiently.
Here is an example dump:
<A001_MAIL.DAT>
0001 Ronald McDonald 01 H81
0002 Elmo St. Elmo 02 H82
0003 Cookie Monster 01 H81
0004 Oscar Grouche 03 H83
0005 Dumb Name 02 H82
0006 Butter Face 04 H84
0007 Ford F150 01 H81
0008 Last One 03 H83
<A001_MAIL_H81.dat>
0001 Ronald McDonald 01 H81
0003 Cookie Monster 01 H81
0007 Ford F150 01 H81
<A001_MAIL_H82.dat>
0002 Elmo St. Elmo 02 H82
0005 Dumb Name 02 H82
<A001_MAIL_H83.dat>
0004 Oscar Grouche 03 H83
0008 Last One 03 H83
<A001_MAIL_H84.dat>
0006 Butter Face 04 H84
This is a very small sample, normally files are 500bytes per line and between a hundred thousand and a hundred million lines.
I'm looking for something that in a simple single line command will pass the file once and create files similar to how I've shown above. I'm very new to awk but I created something that almost accomplished my goal.
awk '!/^$/{
key=substr($0,28,3)
print $0 > key".dat"
}' A001_MAIL.DAT
This takes the file and does essentially the following
<H81.dat>
0001 Ronald McDonald 01 H81
0003 Cookie Monster 01 H81
0007 Ford F150 01 H81
<H82.dat>
0002 Elmo St. Elmo 02 H82
0005 Dumb Name 02 H82
<H83.dat>
0004 Oscar Grouche 03 H83
0008 Last One 03 H83
<H84.dat>
0006 Butter Face 04 H84
What I need help with is getting the naming convention corrected and turning this into something I can have other execute in a single like that is a simple as possible I was thinking some such as.... $ awksplit filename
Help me recycle this or point me in a new direction.
Okay, it's totally my fault for creating a sample that isn't truly accurate to my needs because it does appear to work in this case but it doesn't fit my actual file. Here is a more accurate sample
0001 Ronald McDonald 01 H81 0001256 X
0002 Elmo St. Elmo 02 H82 0089621 X
0003 Cookie Monster 01 H81 0887141 X
0004 Oscar Grouche 03 H83 0364471 X
0005 Dumb Name 02 H82 0000233 X
0006 Butter Face 04 H84 0014666 X
0007 Ford F150 01 H81 0000001 X
0008 Last One 03 H83 7741668 X
Is it possible to substring the field instead of using the 'NF' option because in the files it is possible to there to be anywhere from 6 - 50 populated fields between the field to split on and the 'X' at the EOL. Sorry that I'm being such a pain, but I really appreciate the help. Also I'm looking for the output names to be like this <infile name>_<string>.dat
Is it safe to assume that the 'strings' are always in the form 'H<digit><digit><digit>..' ?
Are the fields TAB separated by any chance?
What 'safe' assumption can be made to get to the string?
Okay, this is a fixed length file, so no tabs, the string is a 25 character varchar so it does not conform to my H### example. The 'safe' assumption is the the string to split is always are the same byte position in my example byte 29 for length of 3. There is no consistent number of fields before or after the split string as they can be populated or not depending on the data available for the individual represented.