Python: Parsing and comparing XMLs with minidom

Bloomy · July 16, 2012, 3:47am

Hi there!

I'd like to parse and compare 2 XML files with the minidom parser as follows:

I have 2 XML files with loads of data. One is in English (the source file), the other one the corresponding French translation (the target file).
E.g.:
source file:

<macro>
       <id> 123</id>
              <string> DOG </string>
              <string>dogs/dog/dog's</string>
              <string>Cross-language reference</string>
              <string>English dog: dogs/dog/dog's</string>
      (..........)
<macro>

target file:
<macro>
       <id> 123</id>
              <string> CHIEN </string>
              <string>chien/chiens</string>
              <string>Cross-language reference</string>
              <string></string>
      (..........)
<macro>

The French target file has an empty cross-language reference where I'd like to put in the information from the English source file whenever the 2 macros have the same ID.
I already wrote some code in which I replaced the string tag name with a unique tag name in order to identify the cross-language reference. I also extracted the ID and the Cross-language reference info from both files. Now I want to compare the 2 files and if 2 macros have the same ID, exchange the empty reference in the French file with the info from the English file.

Here is my code:

import re
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
 
#open the xml file for reading that contains the correct CL references:
file = open('PATH/english.xml','r+')
#open the xml file for reading that contains the missing CL references:
file_target =  open('PATH/french.xml','r+')
#convert to string:
data = file.read()
#replace xml tag with a unique name in order to identify it later on
data = re.sub(r"<string>Cross-language reference</string>(\s+)<string>(.*)</string>",r"<cl>Cross-language reference</cl>\1<cl>\2</cl>",data)
file.seek(0)
file.write(data)
#remove old data
file.truncate()
#close file because we don't need it anymore:
file.close()
#convert to string:
target = file_target.read()
#replace xml tag with a unique name in order to identify it later on
target = re.sub(r"<string>Cross-language reference</string>(\s+)<string>(.*)</string>",r"<cl>Cross-language reference</cl>\1<cl>\2</cl>",target)
file_target.seek(0)
file_target.write(target)
#remove old data
file_target.truncate()
#close file because we don't need it anymore:
file_target.close()


#extract CL-reference from source file
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('cl')[1].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<cl>','').replace('</cl>','')
#print out the xml tag and data in this format: <tag>data</tag>
print (xmlTag)
#just print the data
print (xmlData)

#IdTag = dom.getElementsByTagName('id')[0].toxml()
#IdData = xmlData=xmlTag.replace('<tagName>','').replace('</tagName>','')

#extract CL-reference from target file
dom_2 = parseString(target)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag_2 = dom_2.getElementsByTagName('cl')[1].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData_2=xmlTag_2.replace('<cl>','').replace('</cl>','')
#print out the xml tag and data in this format: <tag>data</tag>
print (xmlTag_2)
#just print the data
print (xmlData_2)


#extract id from source file
xyz = parseString(data)
xmlTag_id_source = xyz.getElementsByTagName('id')[0].toxml()
xmlData_id_source = xmlTag_id_source.replace('<id>','').replace('</id>','')
print ("xmlTag_id_source: "+xmlTag_id_source)
print ("xmlData_id_source: "+xmlData_id_source)

#extract id from target file
abc = parseString(target)
xmlTag_id_target = abc.getElementsByTagName('id')[0].toxml()
xmlData_id_target = xmlTag_id_target.replace('<id>','').replace('</id>','')
print (xmlTag_id_target)
print (xmlData_id_target)


with open(file,'r')as sfile:
    with open(file_target,'w') as tfile:
        lines = sfile.readlines()
        if xmlTag_id_source==xmlTag_id_target:
         # do the replacement in the second line.
         # (remember that arrays are zero indexed)
             lines[1]=re.sub(xmlData_2,xmlData,lines[1])
             tfile.writelines(lines)

print ("DONE")

The replacement of the tag as well as the extraction of ID and cross-language reference work, but the final part where I'm trying to replace the stuff in the target file returns the error:

Traceback (most recent call last):
  File "PATH\test.py", line 74, in <module>
    with open(file,'r') as sfile:
TypeError: invalid file: <_io.TextIOWrapper name='PATH\english.xml' mode='r+' encoding='cp1252'>

I didn't find any useful information on the web that helped me figure out what's wrong.
I know that my code above might look a bit messy, but I'm a beginner and things like "you should use a different parser" or "you should do it completely different" won't really help me. I'd like to know how I can find a solution by using and adjusting my code above :-).

I am grateful for any suggestions! Thanks in advance and kind regards!