Split file at location of textpattern

borgeh · September 26, 2007, 9:16am

I have a file that I want to split in 2 (with Bourne shell sh) preferably. The file is a configuration file for several elements and hence consists of a repeated configuration pattern like this:

config.txt:
#fruit banana
#color yellow
#surface smooth
size 20cm

#fruit apple
#color green
#surface smooth
size 7cm

#fruit grape
#color green
#surface smooth
size 2cm

I want to split the file in 2 as equal as possible pieces but a split has to be done at the start of an element (starting with a #fruit entry). If the configuration file has an odd number of entries it should allow one more item in one of the files, and if not should split so that the 2 resulting files will have the same amount of items.

The tags like "#fruit" are unique so they can be used in e.g. "grep" combined with "wc -l" to find amount of items and at which element to split.

Is this a typical awk job?

Borgeh

drl · September 26, 2007, 10:09am

Hi.

How is this different from http://www.unix.com/shell-programming-and-scripting/43297-splitting-av-file-in-2-at-specific-place-based-on-textpattern.html\#post302137399
and what have you done so far to solve it? ... cheers, drl

borgeh · September 26, 2007, 10:18am

It's not different, but since I received no answer on that query I decided to write the problem in a different way since maybe it was difficult to understand what I meant. I have a feeling though that this should be an easy task to solve, but I am stuck. I have determined at which element I should split by grep'ing for "#fruit" to find number of elements and using "expr" and "/" to get the closest integer value of the number of the element where I should split. But from there I am unsure about the rest. I have a feeling that awk should be the way to go but I am not sure how. Another option is to find the line number of the start of the element where I should cut.

tomas · September 26, 2007, 10:34am

grep for #fruit then get a count with wc -l. Your source is structured with 5 lines for each entry so divide the number of #fruit entries found by 2 then multiply that by 5 using bc. You can then use the split -l command to make your two files using those results. I would add something to make sure none of the lines go missing.

borgeh · September 26, 2007, 11:05am

Thanks!
I think this is close to a way to do it. How can x5 help me to find correct place to cut?
Something like this might work:

Filter out heading or trailing newlines to assure the count will be correct.
Grep for "#fruit" and pipe it through "wc -l" to get amount of "#fruit" - elements.
If number is even number I can split in middle

"wc -l"/2.

If number is odd I can split at:

("wc -l"/2)+3 lines

And then I probably have to adjust 1 lines up or down to get the split exact.
Hmm...if this works I need to find out if a number is odd or even.

Borgeh

drl · September 26, 2007, 11:07am

Hi.

An awk script may be useful. There is a special variable "RS", Record Separator, that may be be set to read "paragraphs", i.e. groups of lines separated by an empty line:

RS = ""

That would allow you to treat your file as essentially just a number of such records.

With your calculated knowledge of where you want to the split to be, the "pattern" part of an awk statement:

  pattern { action }

should allow you to complete the solution with the use of another builtin variable "NR", Number of Record. This is because the pattern part may be a logical expression, such as:

NR <= 5 { some-action-for-this-case }

the action might be something as simple as print ... cheers, drl

jim_mcnamara · September 26, 2007, 11:26am

line="`cat filename|wc -l`"
echo "$line / 2" | bc | read value
csplit -f fruit config.txt '/#fruit/'+"$value"