Using AWK to separate data from a large XML file into multiple files

I have a 500 MB XML file from a FileMaker database export, it's formatted horribly (no line breaks at all). The node structure is basically

<FMPXMLRESULT>
  <METADATA>
   <FIELD att="............." id="..."/>
  </METADATA>
  <RESULTSET FOUND="1763457">
   <ROW att="....." etc="....">
     <COL>.....etc....</COL>
   </ROW>
   <ROW att="....." etc="....">
     <COL>.....etc....</COL>
   </ROW>
   <ROW att="....." etc="....">
     <COL>.....etc....</COL>
   </ROW>
  </RESULTSET>
</FMPXMLRESULT>

There are two things I need to get out of that file:

  1. I'd like to generate an XML file that just contains everything within the < METADATA > nodes (the < FIELD > nodes) and I'll name it fields.xml

2.Then I'd like to generate an XML for each individual < ROW > node, and incrementally name each row1.xml, row2.xml, etc...

I'm using AWK via Terminal in OS X Leopard, I'm not sure how to go about item #1, but for #2 I tried the following:

awk '/<ROW/{close("row"c".xml");c++}{print $0 > "row"c".xml"}' db.xml

Which produces a syntax error at line 1 when executed.

Can anyone help me out with these issues? What am I doing wrong?

Your help is very much appreciated.

Try this awk code.

 /<METADATA>/ {
        getline
        while ( $0 !~ /<\/METADATA>/ ) {
                print > "fields.xml"
                getline
        }
        count=1
        nextline
}

/<ROW/ {
        rfile="row" count ".xml"
        getline
        while ($0 !~ "<\/ROW" ) {
                print > rfile
                getline
        }
        close(rfile)
        count++
        nextline
}

Thanks for the quick reply, when I try those:

awk '/<METADATA>/ {
        getline
        while ( $0 !~ /<\/METADATA>/ ) {
                print > "fields.xml"
                getline
        }
        count=1
        nextline
}' db.xml

and

awk '/<ROW/ {
        rfile="row" count ".xml"
        getline
        while ($0 !~ "<\/ROW" ) {
                print > rfile
                getline
        }
        close(rfile)
        count++
        nextline
}' db.xml

I get an illegal statement error, am I doing something wrong? Thank you so much for the help so far!

It wasn't designed to be used as separate clauses.. put the whole thing in a file and use the -f switch.

Sorry I'm a complete AWK beginner, I've been programming for about 8 years, but only learned of AWK about an hour before I posted.

Let me make sure I understand everything completely, this is what I'm trying step by step, please correct me where I'm wrong:

  1. I have my working directory, in it I have db.xml file
  2. I create a file called split.awk inside my working directory, in it I put the file contents:
 /<METADATA>/ {
        getline
        while ( $0 !~ /<\/METADATA>/ ) {
                print > "fields.xml"
                getline
        }
        count=1
        nextline
}

/<ROW/ {
        rfile="row" count ".xml"
        getline
        while ($0 !~ "<\/ROW" ) {
                print > rfile
                getline
        }
        close(rfile)
        count++
        nextline
}
  1. I open up terminal, cd to my working directory and then execute:
awk -f split.awk db.xml

When I execute that, I just get an error saying awk can't find the file.

Again, sorry for being such a beginner -- now that I know AWK exists, I plan to purchase a few books on and dive into how I can apply in my day-to-day programming.

Thank you!

what is the exact output of awk? This seems to happen mostly when there are invisible characters introduced to the awk file during the copy of the text to the .awk file. And make sure all the files are readable and the directory is writable by the account you use to open the terminal window...

awk '/<ROW/{close("row"c".xml");c++}c{f="row"c".xml";print $0 > f}' file

The terminal prints out the following:

awk: can't open file split.awk
 source line number 1 source file split.awk
 context is
     >>>  <<< 

Do I need to put in a full file path? I've already navigated to the directory within terminal, it's in the same directory as db.xml, which seems to get picked up fine.

When I execute this, I just an exact copy of my original file with the number 1 appended to it, ex: db1.xml, but it's also a 500 MB file.

Thanks again to both of you for your help so far.

I can't see where is the problem ... can you elaborate ?

Starting from your original data sample I get:

# awk '/<ROW/{close("row"c".xml");c++}c{f="row"c".xml";print $0 > f}' file
# ls
file            row1.xml        row2.xml        row3.xml

what happens if you type the awk command in the window that you used to create the split.awk file? The error message suggests that you don't have read access to the split.awk file in the new terminal window....

Ah, I see the issue. If I test with the sample XML I provided in the original post, it works. But it's when I try to execute it on the actual XML export from FileMaker that it fails and just stops after creating one file, which contains all of the < ROW matches.
I think maybe it's because the FileMaker XML export is really crappy and has absolutely no formatting at all, there's no line breaks, no indenting, no nothing, all the instances of the < ROW > </ROW> node are all on the same line.

Would that be an issue and is there a way around it?

Same result, I also verified the standard OS X account (my account + the staff account + everyone) has both read/write access to split.awk.

** UPDATE ** so I tried using the touch command via terminal to create the file split.awk, then edited it's contents via CODA (still using the exact code you provided), and when I run the following command:

awk -f split.awk db.xml

I now get the following error:

awk: illegal statement
 input record number 1, file db.xml
 source line number 8

The only thing that I can think of is that split.awk contains more than just the script code. Invisible characters? I've tried this on both Centos and a windows (cygwin) system...

Here's the exact contents of my file if you want to take a look:
share1t.com File Sharing | Download: split.awk
I still feel like I may be doing something wrong.

Thank you again for your help, I really do appreciate it. :slight_smile:

I downloaded the script onto both my my Centos (linux) system and my windows (cygwin) system. It worked fine on my Linux system. On the windows system, there were translation errors. There are hidden characters...

I really don't know what's left to try then.. for another test, I used terminal, executed a touch command to create split2.awk and then used TextEdit to retype the code character by character with no tabs and I still have the same exact problem.

Are there any other possible solutions? I never thought it would be this much of a hassle -- again, thank you so much for the help.

opps.. replace 'nextline' with 'next'. This will probably work :slight_smile:

Ah ha, that works. Thank you, jp!