XML Problem

Hello, I need a script to edit a custom XML, although I know it should be fairly easy to create such a script, I'm failing miserably.
The script should be able to read from a file containing the ids of one tag of the xml (<content contentid="XXX".... for example) and then remove this content.
For instance, for the simple XML file like this:

<categorygroup categorygroupid="test">
 <category categoryid="test_category1">
  <content contentid="0001" name="content_test">
  ...
  </content>
  <content contentid="0002" name="content_test2">
  ...
  </content>
  <content contentid="0003" name="content_test3">
  ...
  </content>
 </category>
 <categorygroup categorygroupid="test">
 <category categoryid="test_category2">
  <content contentid="0011" name="content_test1">
  ...
  </content>
  <content contentid="0012" name="content_test12">
  ...
  </content>
  <content contentid="0013" name="content_test13">
  ...
  </content>
 </category>
</categorygroup>

If one has the codes 0001, 0012 and 0013 on the file, it should become this xml file:

<categorygroup categorygroupid="test">
 <category categoryid="test_category1">
  <content contentid="0002" name="content_test2">
  ...
  </content>
  <content contentid="0003" name="content_test3">
  ...
  </content>
 </category>
 <categorygroup categorygroupid="test">
 <category categoryid="test_category2">
  <content contentid="0011" name="content_test1">
  ...
  </content>
 </category>
</categorygroup>

Now, I'm pretty sure this should be easy, but I'm having a VERY big amount of trouble by doing this (I've tried PERL, Ruby, PHP and even sed with grep) can anyone help me?

Thanks.

This appears to work for the sample you posted:

perl -0777 -pe 's%^\s*<content contentid="(0001|001[23])"[^<>]*>(.*?)</content>\s$*%%msg' file.xml

The ^ and $ decorations are probably unnecessary, if the result is mainly intended to be machine-readable. The real beef is the -0777 option and the .*? regex coupled with the /s modifier. See the Perl FAQ for more on these.

Hum...that seems good, but where do I put the input code to remove from the XML? (I'm really no expert at regular expressions...yet)
Also, please remember that this codes are fed up by a file, and honestly, I know absolutely nothing about PERL...or at least not enough to read a file and feed every line (removing the \n) to a specific regexp.

thanks a lot

That's the entire program. Replace file.xml with the name of the input file. Redirect to a temporary file, or use perl -i to change the original file "in place".

This I understood, the file.xml is the xml file to remove the content from, but how do I feed the perl program with the codes to remove? I tried creating a big file with all the codes piped (e.g.: 0001|0002|3142|5342|7890....) and then cat it with the perl program you passed:

perl -0777 -pe 's%^\s*<content contentid="(`cat codes.txt`)"[^<>]*>(.*?)</content>\s$*%%msg' file.xml

But it didn't worked. Am I missing something here?

Thanks.

It's looking for literally the contents of the file, you need to process it to make a decent regular expression out of it.

Better do that in Perl directly, too.

perl -0777 -pe 'BEGIN {
    open (C, "codes.txt") || die "$!"; $c = <C>; close C; chomp $c; $c =~ y/\n/|/; }
  s%^\s*<content contentid="($c)"[^<>]*>(.*?)</content>\s$*%%msg' file.xml

This isn't particularly elegant; there is some pressure to put this into a file rather than try to pretend it's still a one-liner. You should probably refactor it a bit then.