Split files based on row delimiter count

amvip · February 7, 2017, 5:52pm

I have a huge file (around 4-5 GB containing 20 million rows) which has text like:

<EOFD>11<EOFD>22<EORD>2<EOFD>2222<EOFD>3333<EORD>3<EOFD>44<EOFD>55<EORD>66<EOFD>888<EOFD>9999<EORD>

Actually above is an extracted file from a Sql Server with each field delimited by <EOFD> and each row ends with <EORD>. I need to split the file into chunks of maybe 2 million rows. Now since this is not a normal delimited file as its basically a file with a single huge line having <EORD> as the indicator for each row end.
Can someone please advise how can I proceed with the same?

Scrutinizer · February 7, 2017, 11:33pm

If you have GNU awk ( gawk ) or mawk you could try something like this, which should split the file in chunks (new files ending with "-chunknr") of 20,000,000 rows where the last file contains the remainder of rows:

awk -v n=20000000 'BEGIN{ORS=RS="<EORD>"} !(NR%n-1){close(f); f=FILENAME "-" ++c}{print>f}' file

amvip · February 13, 2017, 12:24pm

Thanks. The command works fine. Just one thing.It takes really huge time to split a file say for size 3 GB. Is there a workaround for this? And just a small correction - Changed n=20000000 to n=2000000 as need files in chunks of 2 million rows and not 20 million.

Corona688 · February 13, 2017, 12:38pm

Define 'really huge'. How long for how large a file? Numbers please.

What is your disk speed? Are you reading and writing to the same disk?

amvip · February 13, 2017, 1:00pm

For scenario wherein the file has around 20 million records and size is around 3 GB, the split kept on running for more than 10 minutes and hence I had to close the session as this benchmark was unacceptable.
And I am reading and writing to the same disk. For the disk speed part, I am a bit novice in unix so have to do a little search for that.

Corona688 · February 13, 2017, 1:06pm

About how much output data was created in this time?

Reading and writing to the same disk greatly reduces its speed, especially with a spinning disk which must seek repeatedly.

amvip · February 13, 2017, 1:19pm

The total data created was around 1.2 million

drl · February 13, 2017, 3:42pm

Hi.

In addition to posting numbers instead of the very subjective really huge time (to me, that time might be a life expectancy, say 80 years), it's always useful to see some comparisons, like the results of command time wc <your-file-name> , time grep 'e' <your-file-name> , etc.

Best wishes ... cheers, drl

Corona688 · February 13, 2017, 4:17pm

1.2 million records? How large was that in bytes?

This does seem oddly slow, if the records are as you've shown them, about 100 bytes each, that'd be about 120 megs.