splitting a large text file into paragraphs

Hello all, newbie here. I've searched the forum and found many "how to split a text file" topics but none that are what I'm looking for.

I have a large text file (~15 MB) in size. It contains a variable number of "paragraphs" (for lack of a better word) that are each of variable length. A paragraph might be 2 lines long, or it might be 2000 lines long, or anything in between. Each paragraph begins with the same string of text in its first line, and is preceded by a blank line. There could be random blank lines throughout each paragraph. The "paragraph start" string ONLY appears at the start of each paragraph and never anywhere else.

I need a script that will read this huge text file, and save each paragraph out as a separate text file with some kind of unique name.

For example, if our big file contains:

Paragraph start. sdfgsdfgsdfggggggggggggggggggggggggggggggggg
dddddddddddddfgsddddddddddddddddddddddddddd
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
33333333333333333333333333333333333333333333333

Paragraph start. gfdsdfgsdfgsdfgsdfdssssssssssssssssssssssssssssffffff
fgfdssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss

gfdsdrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

gsssssssssssssssssssssssssssssssssssssssssssssssssss
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Paragraph start. gfdsdfggggggggggggggggggggggggggggggggggggg
5555555555555555555555555555555555555555555555

I need it to read this big file, and produce the following separate text files:

Output file 1:

Paragraph start. sdfgsdfgsdfggggggggggggggggggggggggggggggggg
dddddddddddddfgsddddddddddddddddddddddddddd
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
33333333333333333333333333333333333333333333333

Output file 2:

Paragraph start. gfdsdfgsdfgsdfgsdfdssssssssssssssssssssssssssssffffff
fgfdssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss

gfdsdrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

gsssssssssssssssssssssssssssssssssssssssssssssssssss
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Output file 3:

Paragraph start. gfdsdfggggggggggggggggggggggggggggggggggggg
5555555555555555555555555555555555555555555555

It seems like a simple problem, but it is above the reach of my modest shell scripting skills.

Thanks in advance!

Hi

awk '/Paragraph start/{if(NR!=1){for(i=0;i<j;i++)print a>"file"k;j=0;k++;}a[j++]=$0;next}{a[j++]=$0;}END{for(i=0;i<j;i++)print a>"file"k}' i=0 k=1  file

You will have output files created as file1, file2, file3 and so on.

Guru.

1 Like

Very nice Guru, you are GOOD sir! Thank you. :b:

You should put a 'close' in there unless you want to rapidly run into awk's maximum open file limit. but then you'd need nawk.

1 Like