AWK Shell Program to Split Large Files

mkastin · June 29, 2009, 9:32am

Hi,

I need some help creating a tidy shell program with awk or other language that will split large length files efficiently.

Here is an example dump:

<A001_MAIL.DAT>
0001  Ronald   McDonald  01 H81
0002  Elmo     St. Elmo  02 H82
0003  Cookie   Monster   01 H81
0004  Oscar    Grouche   03 H83
0005  Dumb     Name      02 H82
0006  Butter   Face      04 H84
0007  Ford     F150      01 H81
0008  Last     One       03 H83

<A001_MAIL_H81.dat>
0001  Ronald   McDonald  01 H81
0003  Cookie   Monster   01 H81
0007  Ford     F150      01 H81

<A001_MAIL_H82.dat>
0002  Elmo     St. Elmo  02 H82
0005  Dumb     Name      02 H82

<A001_MAIL_H83.dat>
0004  Oscar    Grouche   03 H83
0008  Last     One       03 H83

<A001_MAIL_H84.dat>
0006  Butter   Face      04 H84

This is a very small sample, normally files are 500bytes per line and between a hundred thousand and a hundred million lines.

I'm looking for something that in a simple single line command will pass the file once and create files similar to how I've shown above. I'm very new to awk but I created something that almost accomplished my goal.

awk '!/^$/{
key=substr($0,28,3) 
print $0 > key".dat"
}' A001_MAIL.DAT

This takes the file and does essentially the following

<H81.dat>
0001  Ronald   McDonald  01 H81
0003  Cookie   Monster   01 H81
0007  Ford     F150      01 H81

<H82.dat>
0002  Elmo     St. Elmo  02 H82
0005  Dumb     Name      02 H82

<H83.dat>
0004  Oscar    Grouche   03 H83
0008  Last     One       03 H83

<H84.dat>
0006  Butter   Face      04 H84

What I need help with is getting the naming convention corrected and turning this into something I can have other execute in a single like that is a simple as possible I was thinking some such as.... $ awksplit filename

Help me recycle this or point me in a new direction.

Thanks for all the help everyone!
Matthew

vgersh99 · June 29, 2009, 9:49am

nawk '{if(out) close(out);out=$NF ".dat"; print >> out}' myFile

mkastin · June 29, 2009, 10:55am

Thanks for the quick reply vgersh99, but I don't have nawk and cannot have it installed.

vgersh99 · June 29, 2009, 11:01am

why don't you try 'awk' instead.

mkastin · June 29, 2009, 11:09am

I guess I should've mentioned that I did try that, it didn't work, it created a copy of the orginal file (testmail.dat) as X?.dat.

vgersh99 · June 29, 2009, 11:31am

given mka.txt:

0001  Ronald   McDonald  01 H81
0002  Elmo     St. Elmo  02 H82
0003  Cookie   Monster   01 H81
0004  Oscar    Grouche   03 H83
0005  Dumb     Name      02 H82
0006  Butter   Face      04 H84
0007  Ford     F150      01 H81
0008  Last     One       03 H83

code:

nawk '{if(out) close(out);out=$NF ".dat"; print >> out}' mka.txt

produces 4 files: H81.dat, H82.dat, H83.dat and H84.dat.
E.g. H81.dat:

0001  Ronald   McDonald  01 H81
0003  Cookie   Monster   01 H81
0007  Ford     F150      01 H81

mkastin · June 29, 2009, 11:39am

Okay, it's totally my fault for creating a sample that isn't truly accurate to my needs because it does appear to work in this case but it doesn't fit my actual file. Here is a more accurate sample

0001  Ronald   McDonald  01 H81 0001256 X
0002  Elmo     St. Elmo  02 H82 0089621 X
0003  Cookie   Monster   01 H81 0887141 X
0004  Oscar    Grouche   03 H83 0364471 X
0005  Dumb     Name      02 H82 0000233 X
0006  Butter   Face      04 H84 0014666 X
0007  Ford     F150      01 H81 0000001 X
0008  Last     One       03 H83 7741668 X

Sorry for the confusion.

vgersh99 · June 29, 2009, 12:16pm

awk '{if(out) close(out);out=$(NF-2) ".dat"; print >> out}' mka.txt

mkastin · June 29, 2009, 12:24pm

Is it possible to substring the field instead of using the 'NF' option because in the files it is possible to there to be anywhere from 6 - 50 populated fields between the field to split on and the 'X' at the EOL. Sorry that I'm being such a pain, but I really appreciate the help. Also I'm looking for the output names to be like this <infile name>_<string>.dat

Thanks again and again!

vgersh99 · June 29, 2009, 12:26pm

could you provide a couple of varying samples so that we could see a 'pattern' and identify the fields/strings, please.

mkastin · June 29, 2009, 12:34pm

0001  Ronald   McDonald  01 H81 0001256 0100111               X
0002  Elmo     St. Elmo  02 H82 0089621  001  10 11 01 1      X
0003  Cookie   Monster   01 H81 0887141    1  .  0   0  .  1  X
0004  Oscar    Grouche   03 H83 0364471                   110.X
0005  Dumb     Name      02 H82 0000233   111 00 1111 00000000X
0006  Butter   Face      04 H84 0014666 1111111111111111111111X
0007  Ford     F150      01 H81 0000001 00111 110 110  0011 ..X
0008  Last     One       03 H83 7741668 0 1 0 1 0 1 0 1 0 1 0 X

Have I mentioned yet that you're awesome,

vgersh99 · June 29, 2009, 12:41pm

Is it safe to assume that the 'strings' are always in the form 'H<digit><digit><digit>..' ?
Are the fields TAB separated by any chance?
What 'safe' assumption can be made to get to the string?

mkastin · June 29, 2009, 12:48pm

Okay, this is a fixed length file, so no tabs, the string is a 25 character varchar so it does not conform to my H### example. The 'safe' assumption is the the string to split is always are the same byte position in my example byte 29 for length of 3. There is no consistent number of fields before or after the split string as they can be populated or not depending on the data available for the individual represented.

vgersh99 · June 29, 2009, 12:52pm

nawk '{if(out) close(out);out=FILENAME "_" substr($0,29,3) ".dat"; print >> out}' mka.txt

mkastin · June 29, 2009, 1:10pm

This is great, thanks, I could use a little more help tweaking this though, here are the few things:

can we drop the extenstion off the filename
can we 'compress' extranious white space off the substr in the filename
if I were to put this into a script how can I pass the filename in a argument

Thanks!!!!!!!!!!!!

vgersh99 · June 29, 2009, 1:16pm

mka.sh

#!/bin/ksh
myFile="${1}"
nawk '{if(out) close(out);f=FILENAME; gsub(f, " ","") out=f "_" substr($0,29,3); print >> out}' "${myFile}"

mka.sh /path2myFile/myFile

summer_cherry · June 29, 2009, 11:51pm

awk '{file=sprintf("%s.txt",$5);print $0 >> file}' yourfile