Need help splitting huge single record file

I was given a data file that I need to split into multiple lines/records based on a key word. The problem is that it is 2.5GB or bigger and everything I try in perl or sed causes a Segmentation fault. Can someone give me some other ideas.

The data is of the form:

RANDOMDATA*end*RANDOMDATA*end*RANDOMDATA*end*RANDOMDATA*end*

with no LF's to break it up.
I have tried things such as:

sed "s/\*end\*/\n/g" test1.text > test2.txt
cat test1.text | sed "s/\*end\*/\n/g" > test2.txt
perl -p -e "s/\*end\*/\n/g" test1.text > test2.txt

which all fail with:

Segmentation fault

Any ideas?

Thanks in advance!

Have you played around with the fold command?

I am not sure how I would apply fold here. I only want insert new lines where the string "*end*" occurs and the RANDOMDATA can be of varying length and contain *'s.

Those commands are trying to load the entire line into memory at once, which won't work on a 64 32-bit system since there's a 4GB limit on per-process memory and sufficient clutter in the way that you probably can't get an entire 2.5G in one chunk.

How big are the individual records? You can tell awk to use things other than \n as a record separator, changing its definition of lines:

$  awk -v RS='\\*end\\*' '1' < data
RANDOMDATA
RANDOMDATA
RANDOMDATA
RANDOMDATA
$

Use nawk or gawk if you have it.

1 Like

Ah ha! That's it. In the back of my mind I knew that you could do regex record separators in awk, but for some reason it never occurred to me to try it here. It worked perfectly.
Thanks a lot!

Not in all awk, unfortunately, just new-style awk like gawk or nawk.