Splitting a file based on context.

webkid · December 2, 2010, 3:28pm

I have file as shown below. Would like to split the file based on the context of data.
Like, split the content between "---- XXX Info ----" and "
---- YYY Info ----" to a file.

When I try using below command, 2nd file contains all the info starting after first "---- YYYY Info ----" instance.

csplit -ks pfm.txt '%XXX Info%' '/^---- YYY Info ----/' {2}

Any suggestions how to split the only reqd. data as mentioned above.

---- XXX Info ----
Buuuu xxx bbb
Kmmmm rrr ssss uuuu
Kwwww zzzz ccc
Roooowwww eeee
Bxxxx jjjj dddd
---- YYY Info ----
Kuuuu eeeee nnnn
Rpppp cccc vvvv cccc
Rhhhhhhyyyy tttt
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- XXX Info ----
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- YYY Info ----
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
hhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- XXX Info ----
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- YYY Info ----

---------- Post updated at 03:28 PM ---------- Previous update was at 03:26 PM ----------

For clarification:

I need output files like:
file 1:

---- XXX Info ----
Buuuu xxx bbb
Kmmmm rrr ssss uuuu
Kwwww zzzz ccc
Roooowwww eeee
Bxxxx jjjj dddd

file 2:

---- XXX Info ----
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee

file 3:

---- XXX Info ----
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee

ctsgnb · December 2, 2010, 4:07pm

awk '/^----/{f="file"(++c)".txt"}{print $0 > f}' input

$ cat in
---- XXX Info ----
Buuuu xxx bbb
Kmmmm rrr ssss uuuu
Kwwww zzzz ccc
Roooowwww eeee
Bxxxx jjjj dddd
---- YYY Info ----
Kuuuu eeeee nnnn
Rpppp cccc vvvv cccc
Rhhhhhhyyyy tttt
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- XXX Info ----
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- YYY Info ----
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
hhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- XXX Info ----
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
$ awk '/^----/{f="file"(++c)".txt"}{print $0 > f}' in
$ ls *.txt
file1.txt       file2.txt       file3.txt       file4.txt       file5.txt
$ cat file4.txt
---- YYY Info ----
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
hhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
$

webkid · December 2, 2010, 5:51pm

Thanks for the reply. This seems to be working great on small files. However, I am seeing following problem with a big file.

# awk '/^----/ {print $2}' testy
Port
RG
LU
# awk '/^----/ {f="file"$2".txt"}{print $0 > f}' testy
awk: can't open file
 record number 1

Any idea whats going on?

---------- Post updated at 05:51 PM ---------- Previous update was at 05:08 PM ----------

Actually there were couple of lines a head of the file before it starts with ---- (as shown below). This was causing the problem.
As work around, I removed those lines using csplit prior to run the code you suggested. Is there any better solution for this.

Kwwww zzzz ccc
Buuuu xxx bbb
---- XXX Info ----
Buuuu xxx bbb
Kmmmm rrr ssss uuuu
Kwwww zzzz ccc
Roooowwww eeee
Bxxxx jjjj dddd
---- YYY Info ----
Kuuuu eeeee nnnn
Rpppp cccc vvvv cccc
Rhhhhhhyyyy tttt
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- XXX Info ----
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- YYY Info ----
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
hhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee
---- XXX Info ----
Kyyyyy iiiii wwww
Rwwww rrrr sssss eeee
Rnnnnn xxxxxxccccc
Lhhhh rrrrrssssss
Bffff mmmm iiiii
Ktttt eeeeeee

ctsgnb · December 2, 2010, 6:33pm

You can ignore the first lines until the first ^---- appear by this very little modification of the code :

awk '/^----/{f="file"(++c)".txt"}c{print$0>f}' input

webkid · December 2, 2010, 7:24pm

I am getting following error.

# awk '/^----/{f="file"(++c)".txt"}c{print$0>f}' /tmp/tt
awk: syntax error near line 1
awk: bailing out near line 1

Scott · December 2, 2010, 7:26pm

If you are using Solaris, use nawk or /usr/xpg4/bin/awk

webkid · December 3, 2010, 2:12pm

nawk works. However, what should I use If I have to use $2 instead of ++c.

awk '/^----/ {f="file"$2".txt"}?{print $0>f}' /tmp/tt

instead of

awk '/^----/{f="file"(++c)".txt"}c{print$0>f}' /tmp/tt

Thanks.

Scott · December 3, 2010, 2:53pm

Hi.

Assuming you still want one output file per "section":

awk ' /^----/ { file = "file" $2(++c) }
  c { print > file }
' input.txt

webkid · December 3, 2010, 7:23pm

Great. Thanks for your help.