Extract strings from XML files and create a new XML

Hello everybody,

I have a double mission with some XML files, which is pretty challenging for my actual beginner UNIX knowledge. I need to extract some strings from multiple XML files and create a new XML file with the searched strings..

The original XML files contain the source code for creating PDF files. I write here an abstract example and explain after the challenge.

<Header>My favorite restaurant</Header>
   <breakfast_menu>
      <food>
         <name>Belgian Waffles</name>
         <price>$5.95</price>
         <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
         <calories>650</calories>
       </food>
       <food>
         <name>Strawberry Belgian Waffles</name>
         <price>$7.95</price>
         <description>Light Belgian waffles covered with strawberries and whipped cream</description>
         <calories>900</calories>
       </food>
       <food>
         <name>Berry-Berry American Pie</name>
         <price>$8.95</price>
         <description>Light American Pie covered with an assortment of fresh berries and whipped cream</description>
         <calories>900</calories>
       </food>
       <food>
          <name>French Toast</name>
          <price>$4.50</price>
          <description>Thick slices made from our homemade sourdough bread</description>
          <calories>600</calories></food><food><name>Homestyle Breakfast</name>
          <price>$6.95</price>
          <description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
          <calories>950</calories>
          </food>
   </breakfast_menu>
<Footer>My favorite restaurant</Footer>

So, the UNIX script should extract the Header, the entire rows that contain 'Belgian' and 'American' and the Footer and put them in a new XML file. The list with the searched strings is provided through a separate Input file. I hope I managed to create a clear requirement. Please let me know if any extra information is needed.

Thank you very much,
Milano

Hello and welcome to the forum milano.churchil

  1. This is not a valid xml code.
  2. Please use code tags, as you have accepted by the forum rules.
  3. What have you tried so far?

Have a nice day.

Is this a homework assignment?

Homework must be posted in the homework & coursework questions forum and must include a fully filled out questionnaire from the homework template.

Hello! This is not a homework, is something that I need for work. Please let me now if is necessary to change the topic or put more information. Thank you!

Milano

---------- Post updated at 04:59 AM ---------- Previous update was at 04:56 AM ----------

So far I tried the 'csplit' command, but it doesn't working for what I need, because there are multiple strings to be found and extract into a new XML file.

What is the pathname of the "separate Input file"?
What is the format of the "separate Input file"?
What is the pathname of your "original XML file"?
What pathnames do you to be given to the output file (or files) that are to be created?
Show us a sample "separate Input file".
Show us the exact output file (or files) you want to create with the updated XML file you have provided in post #1 in this thread and the separate Input file that you will provide.

And, PLEASE, use CODE tags when displaying all sample input files, all sample output files, and all sample code segments!

Hello,

  1. The pathname of the input file is C:/temp/input.txt
  2. The format of the input file is .txt
  3. The pathname of the XML file is C:/temp/output.txt
  4. The pathname of the output file is C:/temp/output.xml

Input file input.txt:

'Belgian'
'American'

Output file output.xml:

<Header>My favorite restaurant</Header>
         <name>Belgian Waffles</name>
         <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
         <name>Strawberry Belgian Waffles</name>
         <description>Light Belgian waffles covered with strawberries and whipped cream</description>
         <name>Berry-Berry American Pie</name>
         <description>Light American Pie covered with an assortment of fresh berries and whipped cream</description>
<Footer>My favorite restaurant</Footer>

I hope now is better! Thank you again!

Milano

Better, but still a bit vague. For EXACTLY your setup, this might work:

grep -iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" C:/temp/output.txt
<Header>My favorite restaurant</Header>
         <name>Belgian Waffles</name>
         <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
         <name>Strawberry Belgian Waffles</name>
         <description>Light Belgian waffles covered with strawberries and whipped cream</description>
         <name>Berry-Berry American Pie</name>
         <description>Light American Pie covered with an assortment of fresh berries and whipped cream</description>
<Footer>My favorite restaurant</Footer>

Redirect to C:/temp/output.xml if happy.

Hello RudiC,

Thank you for your reply! It doesn't work for me. I assume that the grep command you gave me is missing the XML file from where the information should be extracted.

Milano

You assume incorrectly. The code RudiC provided does exactly what you asked for given the filenames you provided. But, of course we're making assumptions about the utilities you have installed on your system, the shell you're using, and the operating system you're using.

What operating system are you using?
What version of UNIX/Linux utilities are you using?
What shell are you using?
What output did RudiC's code produce on your system?
Are you sure that the filenames you provided contain data in the same format as your sample data? (For instance, does C:/temp/input.txt contain <carriage-return><newline> line terminators instead of the <newline> line terminators expected by UNIX and Linux system utilities?)

1 Like

Yes, this was my mistake. Is working good for my example. I will test it on larger files too. Now I am trying to delete the extracted rows from the original XML. Hope I will manage it.
Thnak you!

Milano

---------- Post updated at 05:19 AM ---------- Previous update was at 02:28 AM ----------

I tried this to remove the line that were extracted from the xml file, but I got an error of

enexpected end of file

when I am trying to run the script.

sed -i 'iE "$(tr -d "'" </home/qqomtws/kwom/Test_HY/test_hy.txt | tr '\n' '|')" /home/qqomtws/kwom/Test_HY/test_hy.xml' ./infile

I put the command just after the grep command that extracts the lines I am interested in. Any advice?

Thank you!
Milano

---------- Post updated 06-19-15 at 03:07 AM ---------- Previous update was 06-18-15 at 05:19 AM ----------

Yes, this was my mistake. Is working good for my example. I will test it on larger files too. Now I am trying to delete the extracted rows from the original XML. Hope I will manage it.
Thnak you!

I tried this to remove the lines that were extracted from the xml file, but I got an error of

unexpected end of file

when I am trying to run the script.

sed -i 'iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|') " C:/temp/output.txt'./infile

I put the command just after the grep command that extracts the lines I am interested in. Any advice?

Milano

---------- Post updated at 04:22 AM ---------- Previous update was at 03:07 AM ----------

Yes, this was my mistake. Is working good for my example. I will test it on larger files too. Now I am trying to delete the extracted rows from the original file. Hope I will manage it.
Thnak you!

I tried this to remove the line that were extracted from the xml file, but I got an error of

enexpected end of file

when I am trying to run the script.

sed -i 'iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|') " C:/temp/output.txt'./infile

I put the command just after the grep command that extracts the lines I am interested in. Any advice?

Many thanks,
Milano

You can't do that, replace one comand ( grep ) by another ( sed ) with identical parameter set, and hope that it works.
To remove those selected lines from the original file, redirect the grep result to a temp file and try

grep -vfTMP C:/temp/output.txt

The obvious simple thing to do (to extract all of the lines that the 1st grep did NOT extract) would be to just rerun that script adding a -v option:

grep -viE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" C:/temp/output.txt

Or, you could use RudiC's suggestion (but I would add a -F in case some of the strings extracted by the first grep contain characters that are special in a BRE):

grep -vFfTMP C:/temp/output.txt

If you are going to be running this script regularly, I would seriously consider rewriting it to use awk instead of grep . If you use awk you could produce both output files in one pass without needing to run grep twice and without needing to read your input XML file twice:

awk -v ERE="$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" '
BEGIN {ERE = tolower(ERE)}
      {print > ((tolower($0) ~ ERE) ? "matched.xml" : "unmatched.xml")}
' C:/temp/output.txt

Change matched.xml and unmatched.xml to the pathnames of the files you want to contain the matched lines and the unmatched lines, respectively. I assume that you already know that neither of those output files can be the input file for this awk script!

Thank you a lot! It works very well! A great advice also with

awk

Many thanks,
Milano