Non trivial file splitting, saving with variable filename

samask · June 15, 2013, 11:00am

Hello,

Although I have found similar questions, I could not find advice that could help with our problem.

The issue:

We have a few thousands text files (books).

Each book has many chapters. Each chapter is identified by a cite-key. We need
to split each of those book files by chapters, having each chapter's cite-key as
file name.

Example of book file:

* Chapter 1 -- Branchial or Visceral Arches

  :PROPERTIES:
  :GENRE: biology
  :CITE-KEY: DW:1
  :END:


The Branchial or Visceral Arches and Pharyngeal Pouches. -- In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).



* Chapter 2 -- Dorsal and Ventral Diverticulum

  :PROPERTIES:
  :GENRE: biology
  :CITE-KEY: DW:2
  :END:


Each of the upper four pouches is prolonged into a dorsal and a ventral
diverticulum.

Over these pouches corresponding indentations of the ectoderm occur, forming 
what are known as the branchial or outer pharyngeal grooves.


[etc.]

After splitting, we would have a series of files, in same directory as the source:
dw-1.txt, dw-2.txt, etc., each containing only the proper chapter.

As example, file dw-2.txt would contain:

* Chapter 2

  :PROPERTIES:
  :GENRE: biology
  :CITE-KEY: DW:2
  :END:


Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

One may notice those files use org-syntax. We are able to split those files
mapping a function with emacs' (org-map-entries), but the process is way too
slow. The text files do change, and we need to split all the books frequently.
Emacs is way too slow for that.

Could anybody give me a hint on how to do that with awk or some other fast
shell scripting?

Thank you very much.

Scrutinizer · June 15, 2013, 11:13am

Hi, try:

awk '/\* Chapter/{close(f); p=x; f=x} /CITE-KEY/{f=tolower($2) ".txt"; $0=p$0 } !f{p=p $0 ORS} f{print >f}' file

samask · June 15, 2013, 1:30pm

Hi,

It works beautifully, and it is amazingly fast!

The file names are written with a colon, which is not allowed on OS X:
dw:1.txt , is there a way to have a dash instead, like
dw-1.txt ?

I must add that I had to take away the Chapter part,
becasue many chapter headings do not include that word in their text.

So, I have been using:

awk '/\* /{close(f); p=x; f=x} /CITE-KEY/{f=tolower($2) ".txt"; $0=p$0 } !f{p=p $0 ORS} f{print >f}' file

I tried:

awk '/\* {close(f); p=x; f=x} /CITE-KEY/{f=tolower($2) ".txt"; $0=p$0 } !f{p=p $0 ORS} f{print >f}' file

But it throws:

awk: syntax error at source line 1
 context is
    /\* {close(f); p=x; f=x} >>>  /CITE-KEY/{ <<<
awk: bailing out at source line 1

Thank you so much. So much elegance in Awk. Truly inspiring.

Scrutinizer · June 15, 2013, 2:47pm

Nice to hear you can appreciate awk's elegance. I am not aware of a restriction whereby colons would not allowed in file names in OS X, but if you would like to use a dash, try:

awk '$1=="*"{close(f); p=f=x} /CITE-KEY/{f=tolower($2) ".txt"; sub(":","-",f); $0=p $0} !f{p=p $0 ORS} f{print >f}' file

samask · June 15, 2013, 2:55pm

Thank you, it works perfectly.

I can see it uses a different approach. Now I can learn more.

Such brevity, but at the same time expressivity, that is why I feel AWK is so elegant.

Thank you so much, once again.