python - string encoding error

I'm trying to pull a google calendar (successful) and then put the contents into a mysql db (almost successful). On one of the field I keep getting an encode error:

#!/usr/bin/python   

from xml.etree import ElementTree 
import gdata.calendar.data 
import gdata.calendar.client 
import gdata.acl.data 
import atom.data 
import time 
import MySQLdb

calendar_client = gdata.calendar.client.CalendarClient() username = 'user@gmail.com' visibility = 'public' projection = 'full' 
feed_uri = calendar_client.GetCalendarEventFeedUri(calendar=username, visibility=visibility, projection=projection) 

# define mysql db connection/credentials 
conn = MySQLdb.connect (host = "localhost", user = "test1", passwd = "test1", db = "googlecal") 
cursor = conn.cursor ()  
feed = calendar_client.GetCalendarEventFeed(uri=feed_uri) 

print 'Events on Primary Calendar: %s' % (feed.title.text,)
for i, an_event in enumerate(feed.entry):
     print '\t%s. %s' % (i, an_event.title.text,)
     data_point = {}
     data_point[ 'title' ] = (i, an_event.title.text,)
     print '\t%s. %s' % data_point[ 'title' ]
     for a_when in an_event.when:
         print '\t\t%s. %s' % (i, an_event.content.text,)
         data_point[ 'content' ] = (i, an_event.content.text,)
         print '\t\tStart time: %s' % (a_when.start,)
         data_point[ 'start' ] = (a_when.start,)
         print '\t\tEnd time:   %s' % (a_when.end,)
         data_point[ 'end' ] = (a_when.end,)

         cursor.execute("""insert into events (id, title, content, start, end)
         values (NULL, %s, %s, %s, %s)""",
         (data_point[ 'title' ],
           data_point[ 'content' ],
          data_point[ 'start' ],
          data_point[ 'end' ]))

conn.commit()
cursor.close () 
conn.close ()  

Traceback (most recent call last):
   File "/home/unclecameron/Documents/workspace/google_pull_calendar/src/get_gmail.cal.py", line 45, in <module>
     data_point[ 'end' ]))   
File "/usr/lib/pymodules/python2.7/MySQLdb/cursors.py", line 151, in execute
     query = query % db.literal(args)   
File "/usr/lib/pymodules/python2.7/MySQLdb/connections.py", line 247, in literal
     return self.escape(o, self.encoders)   
File "/usr/lib/pymodules/python2.7/MySQLdb/connections.py", line 185, in unicode_literal return db.literal(u.encode(unicode_literal.charset))
    UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 404: ordinal not in range(256)

the db is utf-8, I've also tried inserting it after encoding like this:

data_point[ 'content' ].encode('utf-8', 'ignore')

and

data_point['content'] = i, smart_str(an_event.content.text)

the rest of the fields work fine, what am I doing wrong?

Finding out the actual data that's making it throw up would be a good start. If the data's not actually UTF-8, setting the encoding to UTF-8 won't help.

okay, here's the example that makes it data_point [ 'content' ] break:

0. A group of Rogue River area actors has formed a new
 performance troupe, calling themselves Exit 48. Their first
 production will be an original murder mystery, written by 
Ronda Bagley, a member of the group, to be staged as a dinner
 theater event at The Station Restaurant in Rogue River on
 Friday, September 9th. Dinner in the outdoor setting will begin
 at 6 pm. The interactive show, a western entitled �Welcome
 Wagon�, will begin shortly after that. 

Tickets are $30 per person.  For reservations, please call The
 Station at 541-582-9725. The restaurant is located at 595
 Main Street in Rogue River. 

Guests are invited to participate as much as they like, even
 taking on the roles of some of the minor characters in the
 play. It�s a real whodunit, with lots of clues and plenty of
 suspects. And of course, there�s that wonderful dinner fare for
 which The Station is so well known.

The actors, most of whom live in the Rogue River area, will be
 familiar to patrons of Barnstormers Theater in Grants Pass as
 they have all been involved in a variety of plays for many
 years at that icon of local community theater. 

For information and details, contact:
Eric Epstein, Manager
541-226-6703

(I put some linebreaks in there, the code tags didn't word wrap)

Just copy-pasting it as text doesn't tell me what original encoding it had unfortunately. It also won't show bogus nonprinting characters. Could you save it to a file, so you can hexdump -C < filename and post that?

1 Like
00000000  41 20 67 72 6f 75 70 20  6f 66 20 52 6f 67 75 65  |A group of Rogue|
00000010  20 52 69 76 65 72 20 61  72 65 61 20 61 63 74 6f  | River area acto|
00000020  72 73 20 68 61 73 20 66  6f 72 6d 65 64 20 61 20  |rs has formed a |
00000030  6e 65 77 20 70 65 72 66  6f 72 6d 61 6e 63 65 20  |new performance |
00000040  74 72 6f 75 70 65 2c 20  63 61 6c 6c 69 6e 67 20  |troupe, calling |
00000050  74 68 65 6d 73 65 6c 76  65 73 20 45 78 69 74 20  |themselves Exit |
00000060  34 38 2e 20 54 68 65 69  72 20 66 69 72 73 74 20  |48. Their first |
00000070  70 72 6f 64 75 63 74 69  6f 6e 20 77 69 6c 6c 20  |production will |
00000080  62 65 20 61 6e 20 6f 72  69 67 69 6e 61 6c 20 6d  |be an original m|
00000090  75 72 64 65 72 20 6d 79  73 74 65 72 79 2c 20 77  |urder mystery, w|
000000a0  72 69 74 74 65 6e 20 62  79 20 52 6f 6e 64 61 20  |ritten by Ronda |
000000b0  42 61 67 6c 65 79 2c 20  61 20 6d 65 6d 62 65 72  |Bagley, a member|
000000c0  20 6f 66 20 74 68 65 20  67 72 6f 75 70 2c 20 74  | of the group, t|
000000d0  6f 20 62 65 20 73 74 61  67 65 64 20 61 73 20 61  |o be staged as a|
000000e0  20 64 69 6e 6e 65 72 20  74 68 65 61 74 65 72 20  | dinner theater |
000000f0  65 76 65 6e 74 20 61 74  20 54 68 65 20 53 74 61  |event at The Sta|
00000100  74 69 6f 6e 20 52 65 73  74 61 75 72 61 6e 74 20  |tion Restaurant |
00000110  69 6e 20 52 6f 67 75 65  20 52 69 76 65 72 20 6f  |in Rogue River o|
00000120  6e 20 46 72 69 64 61 79  2c 20 53 65 70 74 65 6d  |n Friday, Septem|
00000130  62 65 72 20 39 74 68 2e  20 44 69 6e 6e 65 72 20  |ber 9th. Dinner |
00000140  69 6e 20 74 68 65 20 6f  75 74 64 6f 6f 72 20 73  |in the outdoor s|
00000150  65 74 74 69 6e 67 20 77  69 6c 6c 20 62 65 67 69  |etting will begi|
00000160  6e 20 61 74 20 36 20 70  6d 2e 20 54 68 65 20 69  |n at 6 pm. The i|
00000170  6e 74 65 72 61 63 74 69  76 65 20 73 68 6f 77 2c  |nteractive show,|
00000180  20 61 20 77 65 73 74 65  72 6e 20 65 6e 74 69 74  | a western entit|
00000190  6c 65 64 20 e2 80 9c 57  65 6c 63 6f 6d 65 20 57  |led ...Welcome W|
000001a0  61 67 6f 6e e2 80 9d 2c  20 77 69 6c 6c 20 62 65  |agon..., will be|
000001b0  67 69 6e 20 73 68 6f 72  74 6c 79 20 61 66 74 65  |gin shortly afte|
000001c0  72 20 74 68 61 74 2e 20  0a 0a 54 69 63 6b 65 74  |r that. ..Ticket|
000001d0  73 20 61 72 65 20 24 33  30 20 70 65 72 20 70 65  |s are $30 per pe|
000001e0  72 73 6f 6e 2e 20 20 46  6f 72 20 72 65 73 65 72  |rson.  For reser|
000001f0  76 61 74 69 6f 6e 73 2c  20 70 6c 65 61 73 65 20  |vations, please |
00000200  63 61 6c 6c 20 54 68 65  20 53 74 61 74 69 6f 6e  |call The Station|
00000210  20 61 74 20 35 34 31 2d  35 38 32 2d 39 37 32 35  | at 541-582-9725|
00000220  2e 20 54 68 65 20 72 65  73 74 61 75 72 61 6e 74  |. The restaurant|
00000230  20 69 73 20 6c 6f 63 61  74 65 64 20 61 74 20 35  | is located at 5|
00000240  39 35 20 4d 61 69 6e 20  53 74 72 65 65 74 20 69  |95 Main Street i|
00000250  6e 20 52 6f 67 75 65 20  52 69 76 65 72 2e 20 0a  |n Rogue River. .|
00000260  0a 47 75 65 73 74 73 20  61 72 65 20 69 6e 76 69  |.Guests are invi|
00000270  74 65 64 20 74 6f 20 70  61 72 74 69 63 69 70 61  |ted to participa|
00000280  74 65 20 61 73 20 6d 75  63 68 20 61 73 20 74 68  |te as much as th|
00000290  65 79 20 6c 69 6b 65 2c  20 65 76 65 6e 20 74 61  |ey like, even ta|
000002a0  6b 69 6e 67 20 6f 6e 20  74 68 65 20 72 6f 6c 65  |king on the role|
000002b0  73 20 6f 66 20 73 6f 6d  65 20 6f 66 20 74 68 65  |s of some of the|
000002c0  20 6d 69 6e 6f 72 20 63  68 61 72 61 63 74 65 72  | minor character|
000002d0  73 20 69 6e 20 74 68 65  20 70 6c 61 79 2e 20 49  |s in the play. I|
000002e0  74 e2 80 99 73 20 61 20  72 65 61 6c 20 77 68 6f  |t...s a real who|
000002f0  64 75 6e 69 74 2c 20 77  69 74 68 20 6c 6f 74 73  |dunit, with lots|
00000300  20 6f 66 20 63 6c 75 65  73 20 61 6e 64 20 70 6c  | of clues and pl|
00000310  65 6e 74 79 20 6f 66 20  73 75 73 70 65 63 74 73  |enty of suspects|
00000320  2e 20 41 6e 64 20 6f 66  20 63 6f 75 72 73 65 2c  |. And of course,|
00000330  20 74 68 65 72 65 e2 80  99 73 20 74 68 61 74 20  | there...s that |
00000340  77 6f 6e 64 65 72 66 75  6c 20 64 69 6e 6e 65 72  |wonderful dinner|
00000350  20 66 61 72 65 20 66 6f  72 20 77 68 69 63 68 20  | fare for which |
00000360  54 68 65 20 53 74 61 74  69 6f 6e 20 69 73 20 73  |The Station is s|
00000370  6f 20 77 65 6c 6c 20 6b  6e 6f 77 6e 2e 0a 0a 54  |o well known...T|
00000380  68 65 20 61 63 74 6f 72  73 2c 20 6d 6f 73 74 20  |he actors, most |
00000390  6f 66 20 77 68 6f 6d 20  6c 69 76 65 20 69 6e 20  |of whom live in |
000003a0  74 68 65 20 52 6f 67 75  65 20 52 69 76 65 72 20  |the Rogue River |
000003b0  61 72 65 61 2c 20 77 69  6c 6c 20 62 65 20 66 61  |area, will be fa|
000003c0  6d 69 6c 69 61 72 20 74  6f 20 70 61 74 72 6f 6e  |miliar to patron|
000003d0  73 20 6f 66 20 42 61 72  6e 73 74 6f 72 6d 65 72  |s of Barnstormer|
000003e0  73 20 54 68 65 61 74 65  72 20 69 6e 20 47 72 61  |s Theater in Gra|
000003f0  6e 74 73 20 50 61 73 73  20 61 73 20 74 68 65 79  |nts Pass as they|
00000400  20 68 61 76 65 20 61 6c  6c 20 62 65 65 6e 20 69  | have all been i|
00000410  6e 76 6f 6c 76 65 64 20  69 6e 20 61 20 76 61 72  |nvolved in a var|
00000420  69 65 74 79 20 6f 66 20  70 6c 61 79 73 20 66 6f  |iety of plays fo|
00000430  72 20 6d 61 6e 79 20 79  65 61 72 73 20 61 74 20  |r many years at |
00000440  74 68 61 74 20 69 63 6f  6e 20 6f 66 20 6c 6f 63  |that icon of loc|
00000450  61 6c 20 63 6f 6d 6d 75  6e 69 74 79 20 74 68 65  |al community the|
00000460  61 74 65 72 2e 20 0a 0a  0a 46 6f 72 20 69 6e 66  |ater. ...For inf|
00000470  6f 72 6d 61 74 69 6f 6e  20 61 6e 64 20 64 65 74  |ormation and det|
00000480  61 69 6c 73 2c 20 63 6f  6e 74 61 63 74 3a 0a 45  |ails, contact:.E|
00000490  72 69 63 20 45 70 73 74  65 69 6e 2c 20 4d 61 6e  |ric Epstein, Man|
000004a0  61 67 65 72 0a 35 34 31  2d 32 32 36 2d 36 37 30  |ager.541-226-670|
000004b0  33 0a                                             |3.|
000004b2

It's choking on the utf-8 encoded (e2 80 9c) left double quotation mark which does not exist (same goes for the right double quotation mark) in the latin-1 encoding.

Regards,
Alister

If that's the raw unconverted data it's definitely UTF8. The 'smart quotes' Microsoft Word likes to substitute for everything under the sun are the only things in there that need it.

The database still seems to believe it has to encode to latin-1. How did you change the database to UTF8?

ALTER  DATABASE  `googlecal`  DEFAULT  CHARACTER  SET utf8 COLLATE utf8_unicode_ci

do I need to do something else? I don't have control of what gets placed on the source google calendar, it could be anything, so it'd be nice to use an encoding that plays nice with others :slight_smile:

My guess would be that the python mysql client has been converting to and sending latin-1 while mysqld has been converting that to utf-8 before writing to the db (since mysqld simultaneously supports different character set/encodings per client connection/per database/server default). This works fine until you have a unicode character that's not part of latin-1.

I suggest fiddling with MySQLdb.connect() parameters. I believe you can set the encoding there and force use of Unicode for returned strings.

Sorry I can't be more specific, but I don't hack with python that much these days and my experience with mysql is quite limited.

Regards,
Alister

1 Like

Thanks alister: yes, it turns out you can specify this in the connection, so I added it to the code like:

#!/usr/bin/python   

from xml.etree import ElementTree 
import gdata.calendar.data 
import gdata.calendar.client 
import gdata.acl.data 
import atom.data 
import time 
import MySQLdb

calendar_client = gdata.calendar.client.CalendarClient()
username = 'user@gmail.com'
visibility = 'public'
projection = 'full'
feed_uri = calendar_client.GetCalendarEventFeedUri(calendar=username, visibility=visibility, projection=projection) 

# define mysql db connection/credentials 
conn = MySQLdb.connect (host = "localhost", user = "test1", passwd = "test1", db = "googlecal", charset = "utf8", use_unicode = True) 
cursor = conn.cursor ()  
feed = calendar_client.GetCalendarEventFeed(uri=feed_uri) 

print 'Events on Primary Calendar: %s' % (feed.title.text,)
for i, an_event in enumerate(feed.entry):
     print '\t%s. %s' % (i, an_event.title.text,)
     data_point = {}
     data_point[ 'title' ] = (i, an_event.title.text,)
     print '\t%s. %s' % data_point[ 'title' ]
     for a_when in an_event.when:
         print '\t\t%s. %s' % (i, an_event.content.text,)
         data_point[ 'content' ] = (i, an_event.content.text,)
         print '\t\tStart time: %s' % (a_when.start,)
         data_point[ 'start' ] = (a_when.start,)
         print '\t\tEnd time:   %s' % (a_when.end,)
         data_point[ 'end' ] = (a_when.end,)

         cursor.execute("""insert into events (id, title, content, start, end)
         values (NULL, %s, %s, %s, %s)""",
         (data_point[ 'title' ],
           data_point[ 'content' ],
          data_point[ 'start' ],
          data_point[ 'end' ]))

conn.commit()
cursor.close () 
conn.close ()  

but now I get syntax error:

Traceback (most recent call last):
  File "/home/unclecameron/Documents/workspace/google_pull_calendar/src/getivdailyview_gmail.cal7.py", line 41, in <module>
    data_point[ 'end' ]))
  File "/usr/lib/pymodules/python2.7/MySQLdb/cursors.py", line 166, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/pymodules/python2.7/MySQLdb/connections.py", line 35, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.ProgrammingError: (1064, 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version 
for the right syntax to use near \'), ("\'2011-09-09T21:00:00.000-07:00\'",))\' at line 2')

which I suspect is because the data from 'content' isn't fitting inside data_point [ 'content' ] very well? In awk/sed/bash this would relate to single/double quotes surrounding a content being put into a variable, not sure in python

Looking at your code, the first thing that stands out is that you are using 2-tuples where I suspect it should be a 1-tuple.

data_point[ 'title' ] contains two elements though the intention is to pass only one.

Assuming that the value you want to pass to the query is an_event.title.text, you could either not include i in the tuple or you could index the desired element of the 2-tuple parameter in the execute() call, e.g. data_point['title'][1].

The same goes for data_point[ 'content' ].

Regards,
Alister

---------- Post updated at 05:26 PM ---------- Previous update was at 05:06 PM ----------

Skimming over Writing MySQL Scripts with Python DB-API, it seems that the execute function expects a tuple sequence consisting of each value to be bound, in the order they are to be bound, to the %s placeholders.

Assuming the following code ...

... I believe the proper call to execute() would be:

         cursor.execute("""insert into events (id, title, content, start, end)
         values (NULL, %s, %s, %s, %s)""",
         (data_point[ 'title' ][1],
           data_point[ 'content' ][1],
          data_point[ 'start' ][0],
          data_point[ 'end' ][0]))

Although if it were my code, I'd modify the assignment statements to store strings instead of tuples in the data_point dictionary. Instead of:

data_point[ 'title' ] = (i, an_event.title.text,)
data_point[ 'content' ] = (i, an_event.content.text,)
data_point[ 'start' ] = (a_when.start,)
data_point[ 'end' ] = (a_when.end,)

You could try:

data_point[ 'title' ] = an_event.title.text
data_point[ 'content' ] = an_event.content.text
data_point[ 'start' ] = a_when.start
data_point[ 'end' ] = a_when.end

Which would then result in the cleaner:

         cursor.execute("""insert into events (id, title, content, start, end)
         values (NULL, %s, %s, %s, %s)""",
         (data_point[ 'title' ],
           data_point[ 'content' ],
          data_point[ 'start' ],
          data_point[ 'end' ]))

Regards,
Alister

1 Like

wow, okay, this really helps me understand, thanks again Alister :slight_smile: . Though the pain, I really have learned something, and am very thankful for all the help. The final code, in case anyone wants to do something similar, is listed below:

#!/usr/bin/python   

from xml.etree import ElementTree 
import gdata.calendar.data 
import gdata.calendar.client 
import gdata.acl.data 
import atom.data 
import time 
import MySQLdb

calendar_client = gdata.calendar.client.CalendarClient()
username = 'user@gmail.com'
visibility = 'public'
projection = 'full'
feed_uri = calendar_client.GetCalendarEventFeedUri(calendar=username, visibility=visibility, projection=projection) 

# define mysql db connection/credentials 
conn = MySQLdb.connect (host = "localhost", user = "test1", passwd = "test1", db = "googlecal", charset = "utf8", use_unicode = True) 
cursor = conn.cursor ()  
feed = calendar_client.GetCalendarEventFeed(uri=feed_uri) 

print 'Events on Primary Calendar: %s' % (feed.title.text,)
for i, an_event in enumerate(feed.entry):
     data_point = {}
     data_point[ 'title' ] = an_event.title.text

     for a_when in an_event.when:
         data_point[ 'content' ] = an_event.content.text
         data_point[ 'start' ] = a_when.start
         data_point[ 'end' ] = a_when.end

         cursor.execute("""insert into events (id, title, content, start, end)
         values (NULL, %s, %s, %s, %s)""",
         (data_point[ 'title' ],
           data_point[ 'content' ],
          data_point[ 'start' ],
          data_point[ 'end' ]))

conn.commit()
cursor.close () 
conn.close ()  

You're quite welcome.