Split files based on row delimiter count

I have a huge file (around 4-5 GB containing 20 million rows) which has text like:

<EOFD>11<EOFD>22<EORD>2<EOFD>2222<EOFD>3333<EORD>3<EOFD>44<EOFD>55<EORD>66<EOFD>888<EOFD>9999<EORD>

Actually above is an extracted file from a Sql Server with each field delimited by <EOFD> and each row ends with <EORD>. I need to split the file into chunks of maybe 2 million rows. Now since this is not a normal delimited file as its basically a file with a single huge line having <EORD> as the indicator for each row end.
Can someone please advise how can I proceed with the same?

If you have GNU awk ( gawk ) or mawk you could try something like this, which should split the file in chunks (new files ending with "-chunknr") of 20,000,000 rows where the last file contains the remainder of rows:

awk -v n=20000000 'BEGIN{ORS=RS="<EORD>"} !(NR%n-1){close(f); f=FILENAME "-" ++c}{print>f}' file
1 Like

Thanks. The command works fine. Just one thing.It takes really huge time to split a file say for size 3 GB. Is there a workaround for this? And just a small correction - Changed n=20000000 to n=2000000 as need files in chunks of 2 million rows and not 20 million.

Define 'really huge'. How long for how large a file? Numbers please.

What is your disk speed? Are you reading and writing to the same disk?

For scenario wherein the file has around 20 million records and size is around 3 GB, the split kept on running for more than 10 minutes and hence I had to close the session as this benchmark was unacceptable.
And I am reading and writing to the same disk. For the disk speed part, I am a bit novice in unix so have to do a little search for that.

About how much output data was created in this time?

Reading and writing to the same disk greatly reduces its speed, especially with a spinning disk which must seek repeatedly.

The total data created was around 1.2 million

Hi.

In addition to posting numbers instead of the very subjective really huge time (to me, that time might be a life expectancy, say 80 years), it's always useful to see some comparisons, like the results of command time wc <your-file-name> , time grep 'e' <your-file-name> , etc.

Best wishes ... cheers, drl

1.2 million records? How large was that in bytes?

This does seem oddly slow, if the records are as you've shown them, about 100 bytes each, that'd be about 120 megs.