XML Problem

Zarnick · June 3, 2008, 2:07pm

Hello, I need a script to edit a custom XML, although I know it should be fairly easy to create such a script, I'm failing miserably.
The script should be able to read from a file containing the ids of one tag of the xml (<content contentid="XXX".... for example) and then remove this content.
For instance, for the simple XML file like this:

<categorygroup categorygroupid="test">
 <category categoryid="test_category1">
  <content contentid="0001" name="content_test">
  ...
  </content>
  <content contentid="0002" name="content_test2">
  ...
  </content>
  <content contentid="0003" name="content_test3">
  ...
  </content>
 </category>
 <categorygroup categorygroupid="test">
 <category categoryid="test_category2">
  <content contentid="0011" name="content_test1">
  ...
  </content>
  <content contentid="0012" name="content_test12">
  ...
  </content>
  <content contentid="0013" name="content_test13">
  ...
  </content>
 </category>
</categorygroup>

If one has the codes 0001, 0012 and 0013 on the file, it should become this xml file:

<categorygroup categorygroupid="test">
 <category categoryid="test_category1">
  <content contentid="0002" name="content_test2">
  ...
  </content>
  <content contentid="0003" name="content_test3">
  ...
  </content>
 </category>
 <categorygroup categorygroupid="test">
 <category categoryid="test_category2">
  <content contentid="0011" name="content_test1">
  ...
  </content>
 </category>
</categorygroup>

Now, I'm pretty sure this should be easy, but I'm having a VERY big amount of trouble by doing this (I've tried PERL, Ruby, PHP and even sed with grep) can anyone help me?

Thanks.

era · June 3, 2008, 2:22pm

This appears to work for the sample you posted:

perl -0777 -pe 's%^\s*<content contentid="(0001|001[23])"[^<>]*>(.*?)</content>\s$*%%msg' file.xml

The ^ and $ decorations are probably unnecessary, if the result is mainly intended to be machine-readable. The real beef is the -0777 option and the .*? regex coupled with the /s modifier. See the Perl FAQ for more on these.

Zarnick · June 4, 2008, 10:37am

Hum...that seems good, but where do I put the input code to remove from the XML? (I'm really no expert at regular expressions...yet)
Also, please remember that this codes are fed up by a file, and honestly, I know absolutely nothing about PERL...or at least not enough to read a file and feed every line (removing the \n) to a specific regexp.

thanks a lot

era · June 5, 2008, 2:13am

That's the entire program. Replace file.xml with the name of the input file. Redirect to a temporary file, or use perl -i to change the original file "in place".

Zarnick · June 5, 2008, 7:55am

This I understood, the file.xml is the xml file to remove the content from, but how do I feed the perl program with the codes to remove? I tried creating a big file with all the codes piped (e.g.: 0001|0002|3142|5342|7890....) and then cat it with the perl program you passed:

perl -0777 -pe 's%^\s*<content contentid="(`cat codes.txt`)"[^<>]*>(.*?)</content>\s$*%%msg' file.xml

But it didn't worked. Am I missing something here?

Thanks.

era · June 5, 2008, 8:24am

It's looking for literally the contents of the file, you need to process it to make a decent regular expression out of it.

Better do that in Perl directly, too.

perl -0777 -pe 'BEGIN {
    open (C, "codes.txt") || die "$!"; $c = <C>; close C; chomp $c; $c =~ y/\n/|/; }
  s%^\s*<content contentid="($c)"[^<>]*>(.*?)</content>\s$*%%msg' file.xml

This isn't particularly elegant; there is some pressure to put this into a file rather than try to pretend it's still a one-liner. You should probably refactor it a bit then.