Help with splitting a large text file into smaller ones

Hi Everyone,

I am using a centos 5.2 server as an sflow log collector on my network. Currently I am using inmons free sflowtool to collect the packets sent by my switches. I have a bash script running on an infinate loop to stop and start the log collection at set intervals - currently one minute.

I have written some fairly indepth analysis using bash and php to display information on the collected logs by with grep, uniq, awk / gawk, sort etc, however I would like to be able to convert this data into a mysql database to start building historic trending. The problem I have is that the log files too big for php to handle in one piece (5-15MB), while the shell is able to rip through them effortlessly.

I have attached below two example sflow datagrams, I would like split the text file into smaller files, one for each datagram.
Ideally the script would remove the datagram and the header information before the first "startSample" and insert just the corresponding "datagramSourceIP xxxx" after each "startSample". But the main thing I am having a problem with is getting all the text between the "startDatagram" and "endDatagram" into a separate file, maybe datag_00001 and so on.
If I could get this working, Im sure I can hack my way through the rest. I have attached below two (simplified) example datagrams so hopefully this will become clear.

Also, if anyone would like some help with getting sflow running please feel free to contact me.

regards,
Joe

startDatagram =================================
datagramSourceIP 128.1.8.211
datagramSize 1332
unixSecondsUTC 1247666217
datagramVersion 5
agentSubId 0
agent 128.1.8.211
packetSequenceNo 3567929
sysUpTime 3321678884
samplesInPacket 8
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 826932
sourceId 0:302
meanSkipCount 200
samplePool 811594123
dropEvents 2567854
sampledPacketSize 66
strippedBytes 4
dstMAC 0014384cffdb
srcMAC 001438512401
IPSize 48
ip.tot_len 48
srcIP 172.16.1.204
dstIP 172.16.1.202
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389905
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 172
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 154
ip.tot_len 154
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389906
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 401
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 383
ip.tot_len 383
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389907
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 110
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 92
ip.tot_len 92
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 601590
sourceId 0:300
meanSkipCount 200
samplePool 1859342402
dropEvents 187738
sampledPacketSize 1522
strippedBytes 8
dstMAC 00005e000132
srcMAC 001635c47fa6
IPSize 1500
ip.tot_len 1500
srcIP 172.16.128.21
dstIP 172.16.129.21
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389908
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 81
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 63
ip.tot_len 63
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389909
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 81
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 63
ip.tot_len 63
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 372438
sourceId 0:303
meanSkipCount 200
samplePool 2583509807
dropEvents 563863
sampledPacketSize 83
strippedBytes 8
dstMAC 00005e000101
srcMAC 0019bb2efe9d
IPSize 61
ip.tot_len 61
srcIP 172.16.1.156
dstIP 172.16.4.79
endSample   ----------------------
endDatagram   =================================
startDatagram =================================
datagramSourceIP 128.1.8.211
datagramSize 1272
unixSecondsUTC 1247666217
datagramVersion 5
agentSubId 0
agent 128.1.8.211
packetSequenceNo 3567930
sysUpTime 3321679274
samplesInPacket 8
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 11266
sourceId 0:25
meanSkipCount 200
samplePool 75214989
dropEvents 0
sampledPacketSize 110
strippedBytes 4
dstMAC 00005e0001c8
srcMAC 00144f61e63f
IPSize 92
ip.tot_len 92
srcIP 172.16.7.8
dstIP 128.1.100.72
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 826933
sourceId 0:302
meanSkipCount 200
samplePool 811595354
dropEvents 2567854
sampledPacketSize 64
strippedBytes 4
dstMAC 0014384cffdb
srcMAC 001438512401
IPSize 46
ip.tot_len 40
srcIP 172.16.1.204
dstIP 172.16.1.202
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 601591
sourceId 0:300
meanSkipCount 200
samplePool 1859342402
dropEvents 187738
sampledPacketSize 68
strippedBytes 8
dstMAC 0014c240a622
srcMAC 0050568767f4
IPSize 46
ip.tot_len 40
srcIP 172.16.0.79
dstIP 172.16.1.152
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 826934
sourceId 0:302
meanSkipCount 200
samplePool 811595354
dropEvents 2567854
sampledPacketSize 1518
strippedBytes 4
dstMAC 0014385196ab
srcMAC 00143851e23e
IPSize 1500
ip.tot_len 1500
srcIP 172.16.1.204
dstIP 172.16.1.203
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389910
sourceId 0:299
meanSkipCount 200
samplePool 612450941
dropEvents 2666515
sampledPacketSize 64
strippedBytes 4
dstMAC 00005e000101
srcMAC 001438505d9c
IPSize 46
ip.tot_len 41
srcIP 172.16.1.205
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389911
sourceId 0:299
meanSkipCount 200
samplePool 612450941
dropEvents 2666515
sampledPacketSize 1518
strippedBytes 4
dstMAC 00143851e23e
srcMAC 0014385196ab
IPSize 1500
ip.tot_len 1500
srcIP 172.16.1.203
dstIP 172.16.1.204
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 826935
sourceId 0:302
meanSkipCount 200
samplePool 811595354
dropEvents 2567854
sampledPacketSize 64
strippedBytes 4
dstMAC 0014385196ab
srcMAC 00143851e23e
IPSize 46
ip.tot_len 40
srcIP 172.16.1.204
dstIP 172.16.1.203
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 3421493
sourceId 0:29
meanSkipCount 200
samplePool 1331420902
dropEvents 0
sampledPacketSize 142
strippedBytes 8
dstMAC 00040d9e7110
srcMAC 001185b99c1b
IPSize 120
ip.tot_len 120
srcIP 172.16.6.3
dstIP 172.16.6.2
endSample   ----------------------
endDatagram   =================================
nawk '/^startDatagram/ {if (out) close(out); out="datag_" sprintf("%05d", ++cnt) ".txt";next} !/^endDatagram/{print >> out}' myHugeFile

wow thanks for a VERY quick response. Worked perfectly first time, although I needed to use gawk. I was almost certain that the solution lay with awk, but i am surprised at how elegant and concise the code is.

Thanks again

Joe