Using AWK to separate data from a large XML file into multiple files

JRy · October 16, 2009, 10:24pm

I have a 500 MB XML file from a FileMaker database export, it's formatted horribly (no line breaks at all). The node structure is basically

<FMPXMLRESULT>
  <METADATA>
   <FIELD att="............." id="..."/>
  </METADATA>
  <RESULTSET FOUND="1763457">
   <ROW att="....." etc="....">
     <COL>.....etc....</COL>
   </ROW>
   <ROW att="....." etc="....">
     <COL>.....etc....</COL>
   </ROW>
   <ROW att="....." etc="....">
     <COL>.....etc....</COL>
   </ROW>
  </RESULTSET>
</FMPXMLRESULT>

There are two things I need to get out of that file:

I'd like to generate an XML file that just contains everything within the < METADATA > nodes (the < FIELD > nodes) and I'll name it fields.xml

2.Then I'd like to generate an XML for each individual < ROW > node, and incrementally name each row1.xml, row2.xml, etc...

I'm using AWK via Terminal in OS X Leopard, I'm not sure how to go about item #1, but for #2 I tried the following:

awk '/<ROW/{close("row"c".xml");c++}{print $0 > "row"c".xml"}' db.xml

Which produces a syntax error at line 1 when executed.

Can anyone help me out with these issues? What am I doing wrong?

Your help is very much appreciated.

jp2542a · October 17, 2009, 12:05am

Try this awk code.

 /<METADATA>/ {
        getline
        while ( $0 !~ /<\/METADATA>/ ) {
                print > "fields.xml"
                getline
        }
        count=1
        nextline
}

/<ROW/ {
        rfile="row" count ".xml"
        getline
        while ($0 !~ "<\/ROW" ) {
                print > rfile
                getline
        }
        close(rfile)
        count++
        nextline
}

JRy · October 17, 2009, 12:20am

Thanks for the quick reply, when I try those:

awk '/<METADATA>/ {
        getline
        while ( $0 !~ /<\/METADATA>/ ) {
                print > "fields.xml"
                getline
        }
        count=1
        nextline
}' db.xml

and

awk '/<ROW/ {
        rfile="row" count ".xml"
        getline
        while ($0 !~ "<\/ROW" ) {
                print > rfile
                getline
        }
        close(rfile)
        count++
        nextline
}' db.xml

I get an illegal statement error, am I doing something wrong? Thank you so much for the help so far!

jp2542a · October 17, 2009, 12:23am

It wasn't designed to be used as separate clauses.. put the whole thing in a file and use the -f switch.

JRy · October 17, 2009, 12:49am

Sorry I'm a complete AWK beginner, I've been programming for about 8 years, but only learned of AWK about an hour before I posted.

Let me make sure I understand everything completely, this is what I'm trying step by step, please correct me where I'm wrong:

I have my working directory, in it I have db.xml file
I create a file called split.awk inside my working directory, in it I put the file contents:

 /<METADATA>/ {
        getline
        while ( $0 !~ /<\/METADATA>/ ) {
                print > "fields.xml"
                getline
        }
        count=1
        nextline
}

/<ROW/ {
        rfile="row" count ".xml"
        getline
        while ($0 !~ "<\/ROW" ) {
                print > rfile
                getline
        }
        close(rfile)
        count++
        nextline
}

I open up terminal, cd to my working directory and then execute:

awk -f split.awk db.xml

When I execute that, I just get an error saying awk can't find the file.

Again, sorry for being such a beginner -- now that I know AWK exists, I plan to purchase a few books on and dive into how I can apply in my day-to-day programming.

Thank you!

jp2542a · October 17, 2009, 12:55am

what is the exact output of awk? This seems to happen mostly when there are invisible characters introduced to the awk file during the copy of the text to the .awk file. And make sure all the files are readable and the directory is writable by the account you use to open the terminal window...

danmero · October 17, 2009, 1:01am

jry:

I'm using AWK via Terminal in OS X Leopard, I'm not sure how to go about item #1, but for #2 I tried the following:
awk '/<ROW/{close("row"c".xml");c++}{print $0 > "row"c".xml"}' db.xml
Which produces a syntax error at line 1 when executed

awk '/<ROW/{close("row"c".xml");c++}c{f="row"c".xml";print $0 > f}' file

JRy · October 17, 2009, 1:33am

The terminal prints out the following:

awk: can't open file split.awk
 source line number 1 source file split.awk
 context is
     >>>  <<<

Do I need to put in a full file path? I've already navigated to the directory within terminal, it's in the same directory as db.xml, which seems to get picked up fine.

When I execute this, I just an exact copy of my original file with the number 1 appended to it, ex: db1.xml, but it's also a 500 MB file.

Thanks again to both of you for your help so far.

danmero · October 17, 2009, 1:49am

I can't see where is the problem ... can you elaborate ?

Starting from your original data sample I get:

# awk '/<ROW/{close("row"c".xml");c++}c{f="row"c".xml";print $0 > f}' file
# ls
file            row1.xml        row2.xml        row3.xml

jp2542a · October 17, 2009, 1:52am

what happens if you type the awk command in the window that you used to create the split.awk file? The error message suggests that you don't have read access to the split.awk file in the new terminal window....

JRy · October 17, 2009, 2:19am

danmero:

I can't see where is the problem ... can you elaborate ?

Starting from your original data sample I get:
# awk '/<ROW/{close("row"c".xml");c++}c{f="row"c".xml";print $0 > f}' file
# ls
file            row1.xml        row2.xml        row3.xml

Ah, I see the issue. If I test with the sample XML I provided in the original post, it works. But it's when I try to execute it on the actual XML export from FileMaker that it fails and just stops after creating one file, which contains all of the < ROW matches.
I think maybe it's because the FileMaker XML export is really crappy and has absolutely no formatting at all, there's no line breaks, no indenting, no nothing, all the instances of the < ROW > </ROW> node are all on the same line.

Would that be an issue and is there a way around it?

Same result, I also verified the standard OS X account (my account + the staff account + everyone) has both read/write access to split.awk.

** UPDATE ** so I tried using the touch command via terminal to create the file split.awk, then edited it's contents via CODA (still using the exact code you provided), and when I run the following command:

awk -f split.awk db.xml

I now get the following error:

awk: illegal statement
 input record number 1, file db.xml
 source line number 8

jp2542a · October 17, 2009, 2:34am

The only thing that I can think of is that split.awk contains more than just the script code. Invisible characters? I've tried this on both Centos and a windows (cygwin) system...

JRy · October 17, 2009, 2:57am

Here's the exact contents of my file if you want to take a look:
share1t.com File Sharing | Download: split.awk
I still feel like I may be doing something wrong.

Thank you again for your help, I really do appreciate it.

jp2542a · October 17, 2009, 4:36am

I downloaded the script onto both my my Centos (linux) system and my windows (cygwin) system. It worked fine on my Linux system. On the windows system, there were translation errors. There are hidden characters...

JRy · October 17, 2009, 5:04am

I really don't know what's left to try then.. for another test, I used terminal, executed a touch command to create split2.awk and then used TextEdit to retype the code character by character with no tabs and I still have the same exact problem.

Are there any other possible solutions? I never thought it would be this much of a hassle -- again, thank you so much for the help.

jp2542a · October 17, 2009, 5:14am

opps.. replace 'nextline' with 'next'. This will probably work

JRy · October 17, 2009, 8:06pm

Ah ha, that works. Thank you, jp!