XML: parsing of the Google contacts XML file

ripat · December 23, 2012, 1:06pm

I am trying to parse the XML Google contact file using tools like xmllint and I even dived into the XSL Style Sheets using xsltproc but I get nowhere.

I can not supply any sample file as it contains private data but you can download your own contacts using this script:

#!/bin/sh

# imports Google Contacts
# imported data is stored in contacts.xml file (current directory)

# You will need curl and xmllint tools

LOGIN="your.login@gmail.com"
PASSW="your_passw"

AUTH=$(curl --silent https://www.google.com/accounts/ClientLogin \
-d Email=$LOGIN \
-d Passwd=$PASSW \
-d accountType=GOOGLE \
-d service=cp \
-d Gdata-version=3.0 | grep '^Auth')

curl --silent -o /tmp/contacts.tmp https://www.google.com/m8/feeds/contacts/default/full?max-results=5 \
--header "Authorization: GoogleLogin auth=${AUTH#*=}" \
--header "GData-Version: 3.0" \

# format nicely the Google output
xmllint --format /tmp/contacts.tmp > contacts.xml

I can get the root node:

$ xmllint --xpath '/' contacts.xml

But it fails when I try the first node below root: <feed>

$ xmllint --xpath '/feed' contacts.xml 
XPath set is empty

fpmurphy · December 23, 2012, 1:47pm

Please understand that you need to supply a sample file if you expect people to help you. Simply obscure your private data or make up some replacement data.

ripat · December 23, 2012, 2:41pm

Here you go. A bit tedious to obscure a Google contacts file. This one contains 3 records.

Yoda · December 23, 2012, 3:23pm

I found this link reporting a similar problem and solution suggesting 2 different approaches. So I thought to share it with you, not sure if it will help.

ripat · December 24, 2012, 8:42am

Thanks for the hint but I have seen that post but it's about another problem.

There is something wrong with that Google XML file -or with my way to access it. When I grep a record in the xmllint shell, it returns some wild card instead of the node path:

$ xmllint --shell cts.xml 
/ > grep Arthur
/*/*[16]/*[5] : tan        9 Arthur M.
/*/*[16]/gd:name/gd:fullName : tan        9 Arthur M.
/*/*[16]/gd:name/gd:givenName : ta-        6 Arthur
/ >

My understanding is that it should have returned a full path to the node. Something like:

/feed/entry/gd:name/gd:fullName

Could that be that the file is corrupt? A xmllint --debug doesn't return anything abnormal though.

But I could be on something:

xmllint --valid cts.xml
cts.xml:2: validity error : Validation failed: no DTD found !
tp://schemas.google.com/g/2005" gd:etag="W/"A0AFRHc4eit7I2A9WhNVEkU.""
                                                                               ^

But all other test files that work all right with xmllint generate the same error on validation...

Again, I am getting nowhere...

fpmurphy · December 24, 2012, 9:21am

No, your sample file is not corrupt. It is valid XML. It just has no DTD - which is fine.

It has multiple namespaces in it, that is why xmllint does not work.

ripat · December 24, 2012, 11:42am

Could xsltproc be able to extract data despite that multiple namespace thing?

fpmurphy · December 24, 2012, 12:11pm

Yes, here is an example XSL stylesheet which outputs the name elements:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:atom="http://www.w3.org/2005/Atom"
                xmlns:gd="http://schemas.google.com/g/2005"
                version="1.0">

   <xsl:output method="text" />

   <xsl:template match="atom:feed">
      <xsl:apply-templates select="atom:entry" />
   </xsl:template>

   <xsl:template match="atom:entry">
      <xsl:apply-templates select="gd:name" />
   </xsl:template>

   <xsl:template match="gd:name">
      FULLNAME: <xsl:value-of select="gd:fullName" />
      GIVENNAME: <xsl:value-of select="gd:givenName" />
      FAMILYNAME: <xsl:value-of select="gd:familyName" />
      <xsl:text>
</xsl:text>
   </xsl:template>

   <xsl:template match="*"/>

</xsl:stylesheet>

which produces the following output from your supplied XML:

      FULLNAME: Arthur M.
      GIVENNAME: Arthur
      FAMILYNAME: M.

      FULLNAME: Eric D.
      GIVENNAME: Eric
      FAMILYNAME: D.

      FULLNAME: Jack Ppppppp
      GIVENNAME: Jack
      FAMILYNAME: Ppppppp

ripat · December 24, 2012, 12:50pm

Great! Thanks.

Now I only have to take a deep breath and dive into XLS. It looks like black magic to me. It took me a couple of months to get use to CSS and now XLS...

If you know a good step by step tutorial...

Merry Xmas

fpmurphy · December 25, 2012, 10:07am

XSLT is a basically declarative pattern-matching language with some functional language concepts. If you have not got experience of declarative languages, it may you a while to get your head around the concepts.

If you wish to learn XSLT, you should also study XPath at the same time as they are frequently used together.