Extract strings from XML files and create a new XML

milano.churchil · June 9, 2015, 5:44am

Hello everybody,

I have a double mission with some XML files, which is pretty challenging for my actual beginner UNIX knowledge. I need to extract some strings from multiple XML files and create a new XML file with the searched strings..

The original XML files contain the source code for creating PDF files. I write here an abstract example and explain after the challenge.

<Header>My favorite restaurant</Header>
   <breakfast_menu>
      <food>
         <name>Belgian Waffles</name>
         <price>$5.95</price>
         <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
         <calories>650</calories>
       </food>
       <food>
         <name>Strawberry Belgian Waffles</name>
         <price>$7.95</price>
         <description>Light Belgian waffles covered with strawberries and whipped cream</description>
         <calories>900</calories>
       </food>
       <food>
         <name>Berry-Berry American Pie</name>
         <price>$8.95</price>
         <description>Light American Pie covered with an assortment of fresh berries and whipped cream</description>
         <calories>900</calories>
       </food>
       <food>
          <name>French Toast</name>
          <price>$4.50</price>
          <description>Thick slices made from our homemade sourdough bread</description>
          <calories>600</calories></food><food><name>Homestyle Breakfast</name>
          <price>$6.95</price>
          <description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
          <calories>950</calories>
          </food>
   </breakfast_menu>
<Footer>My favorite restaurant</Footer>

So, the UNIX script should extract the Header, the entire rows that contain 'Belgian' and 'American' and the Footer and put them in a new XML file. The list with the searched strings is provided through a separate Input file. I hope I managed to create a clear requirement. Please let me know if any extra information is needed.

Thank you very much,
Milano

sea · June 9, 2015, 6:08am

Hello and welcome to the forum milano.churchil

This is not a valid xml code.
Please use code tags, as you have accepted by the forum rules.
What have you tried so far?

Have a nice day.

Don_Cragun · June 9, 2015, 9:43pm

Is this a homework assignment?

Homework must be posted in the homework & coursework questions forum and must include a fully filled out questionnaire from the homework template.

milano.churchil · June 11, 2015, 5:59am

Hello! This is not a homework, is something that I need for work. Please let me now if is necessary to change the topic or put more information. Thank you!

Milano

---------- Post updated at 04:59 AM ---------- Previous update was at 04:56 AM ----------

So far I tried the 'csplit' command, but it doesn't working for what I need, because there are multiple strings to be found and extract into a new XML file.

Don_Cragun · June 11, 2015, 1:17pm

What is the pathname of the "separate Input file"?
What is the format of the "separate Input file"?
What is the pathname of your "original XML file"?
What pathnames do you to be given to the output file (or files) that are to be created?
Show us a sample "separate Input file".
Show us the exact output file (or files) you want to create with the updated XML file you have provided in post #1 in this thread and the separate Input file that you will provide.

And, PLEASE, use CODE tags when displaying all sample input files, all sample output files, and all sample code segments!

milano.churchil · June 12, 2015, 3:57am

Hello,

The pathname of the input file is C:/temp/input.txt
The format of the input file is .txt
The pathname of the XML file is C:/temp/output.txt
The pathname of the output file is C:/temp/output.xml

Input file input.txt:

'Belgian'
'American'

Output file output.xml:

<Header>My favorite restaurant</Header>
         <name>Belgian Waffles</name>
         <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
         <name>Strawberry Belgian Waffles</name>
         <description>Light Belgian waffles covered with strawberries and whipped cream</description>
         <name>Berry-Berry American Pie</name>
         <description>Light American Pie covered with an assortment of fresh berries and whipped cream</description>
<Footer>My favorite restaurant</Footer>

I hope now is better! Thank you again!

Milano

RudiC · June 12, 2015, 5:14am

Better, but still a bit vague. For EXACTLY your setup, this might work:

grep -iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" C:/temp/output.txt
<Header>My favorite restaurant</Header>
         <name>Belgian Waffles</name>
         <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
         <name>Strawberry Belgian Waffles</name>
         <description>Light Belgian waffles covered with strawberries and whipped cream</description>
         <name>Berry-Berry American Pie</name>
         <description>Light American Pie covered with an assortment of fresh berries and whipped cream</description>
<Footer>My favorite restaurant</Footer>

Redirect to C:/temp/output.xml if happy.

milano.churchil · June 16, 2015, 4:48am

Hello RudiC,

Thank you for your reply! It doesn't work for me. I assume that the grep command you gave me is missing the XML file from where the information should be extracted.

Milano

Don_Cragun · June 16, 2015, 3:09pm

You assume incorrectly. The code RudiC provided does exactly what you asked for given the filenames you provided. But, of course we're making assumptions about the utilities you have installed on your system, the shell you're using, and the operating system you're using.

What operating system are you using?
What version of UNIX/Linux utilities are you using?
What shell are you using?
What output did RudiC's code produce on your system?
Are you sure that the filenames you provided contain data in the same format as your sample data? (For instance, does C:/temp/input.txt contain <carriage-return><newline> line terminators instead of the <newline> line terminators expected by UNIX and Linux system utilities?)

milano.churchil · June 19, 2015, 5:22am

Yes, this was my mistake. Is working good for my example. I will test it on larger files too. Now I am trying to delete the extracted rows from the original XML. Hope I will manage it.
Thnak you!

Milano

---------- Post updated at 05:19 AM ---------- Previous update was at 02:28 AM ----------

I tried this to remove the line that were extracted from the xml file, but I got an error of

enexpected end of file

when I am trying to run the script.

sed -i 'iE "$(tr -d "'" </home/qqomtws/kwom/Test_HY/test_hy.txt | tr '\n' '|')" /home/qqomtws/kwom/Test_HY/test_hy.xml' ./infile

I put the command just after the grep command that extracts the lines I am interested in. Any advice?

Thank you!
Milano

---------- Post updated 06-19-15 at 03:07 AM ---------- Previous update was 06-18-15 at 05:19 AM ----------

Yes, this was my mistake. Is working good for my example. I will test it on larger files too. Now I am trying to delete the extracted rows from the original XML. Hope I will manage it.
Thnak you!

I tried this to remove the lines that were extracted from the xml file, but I got an error of

unexpected end of file

when I am trying to run the script.

sed -i 'iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|') " C:/temp/output.txt'./infile

I put the command just after the grep command that extracts the lines I am interested in. Any advice?

Milano

---------- Post updated at 04:22 AM ---------- Previous update was at 03:07 AM ----------

rudic:

Better, but still a bit vague. For EXACTLY your setup, this might work:

grep -iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" C:/temp/output.txt
<Header>My favorite restaurant</Header>
   <name>Belgian Waffles</name>
   <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
   <name>Strawberry Belgian Waffles</name>
   <description>Light Belgian waffles covered with strawberries and whipped cream</description>
   <name>Berry-Berry American Pie</name>
   <description>Light American Pie covered with an assortment of fresh berries and whipped cream</description>
<Footer>My favorite restaurant</Footer>

Redirect to C:/temp/output.xml if happy.

Yes, this was my mistake. Is working good for my example. I will test it on larger files too. Now I am trying to delete the extracted rows from the original file. Hope I will manage it.
Thnak you!

I tried this to remove the line that were extracted from the xml file, but I got an error of

enexpected end of file

when I am trying to run the script.

sed -i 'iE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|') " C:/temp/output.txt'./infile

I put the command just after the grep command that extracts the lines I am interested in. Any advice?

Many thanks,
Milano

RudiC · June 19, 2015, 6:01am

You can't do that, replace one comand ( grep ) by another ( sed ) with identical parameter set, and hope that it works.
To remove those selected lines from the original file, redirect the grep result to a temp file and try

grep -vfTMP C:/temp/output.txt

Don_Cragun · June 19, 2015, 5:09pm

The obvious simple thing to do (to extract all of the lines that the 1st grep did NOT extract) would be to just rerun that script adding a -v option:

grep -viE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" C:/temp/output.txt

Or, you could use RudiC's suggestion (but I would add a -F in case some of the strings extracted by the first grep contain characters that are special in a BRE):

grep -vFfTMP C:/temp/output.txt

If you are going to be running this script regularly, I would seriously consider rewriting it to use awk instead of grep . If you use awk you could produce both output files in one pass without needing to run grep twice and without needing to read your input XML file twice:

awk -v ERE="$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" '
BEGIN {ERE = tolower(ERE)}
      {print > ((tolower($0) ~ ERE) ? "matched.xml" : "unmatched.xml")}
' C:/temp/output.txt

Change matched.xml and unmatched.xml to the pathnames of the files you want to contain the matched lines and the unmatched lines, respectively. I assume that you already know that neither of those output files can be the input file for this awk script!

milano.churchil · June 22, 2015, 6:25am

don cragun:

The obvious simple thing to do (to extract all of the lines that the 1st grep did NOT extract) would be to just rerun that script adding a -v option:
grep -viE "$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" C:/temp/output.txt
Or, you could use RudiC's suggestion (but I would add a -F in case some of the strings extracted by the first grep contain characters that are special in a BRE):
grep -vFfTMP C:/temp/output.txt
If you are going to be running this script regularly, I would seriously consider rewriting it to use awk instead of grep . If you use awk you could produce both output files in one pass without needing to run grep twice and without needing to read your input XML file twice:
awk -v ERE="$(tr -d "'" <C:/temp/input.txt | tr '\n' '|')header|footer" '
BEGIN {ERE = tolower(ERE)}
   {print > ((tolower($0) ~ ERE) ? "matched.xml" : "unmatched.xml")}
' C:/temp/output.txt
Change matched.xml and unmatched.xml to the pathnames of the files you want to contain the matched lines and the unmatched lines, respectively. I assume that you already know that neither of those output files can be the input file for this awk script!

Thank you a lot! It works very well! A great advice also with

awk

Many thanks,
Milano