Extract text between two specified "constant" texts using awk

Hi All,
From the title you may know that this question has been asked several times and I have done lot of Googling on this.

I have a Wikipedia dump file in XML format. All the contents are in one XML file i.e. all different topics have been put in one XML file. Now I need to separate them and make separate files for each topic. After carefully going though the XML file, I found that the topics occur between <page> and </page> tags. I want to use awk to extract the topics and their descriptions in separate files like first topic goes into 1.dat and then second topic into 2.dat and so on till the end of file.
This is how Wikipedia XML file looks:

<page>
<title>APRIL</title>
.........(text contents that I need to extract and store in 1.dat including the <title> tag)
</page>
<page>
<title>August</title>
....(text contents that I need to store in 2.dat including the <title> tag)
</page>

so on.......

I have done this but it created havoc.

awk '</page>/{s++}print > "s.dat" s}' wiki.xml

Try something like this:

awk '/<page>/{c++}{print > c ".xml"}' file
1 Like