My introduction to unicode, and Natural Earth ESRI shapefiles

Working on a project for a friend I am adding data for different countries into a shapefile of the world. On a tip from Andy Woodruff, a cartographer I met at the Hubway Challenge Awards Ceremony, I switched to using the Natural Earth maps which are completely in the public domain. Great maps, with the added bonus of country names with non-ascii characters!

The data I started with is in Excel, with ALL CAPS country names. I needed to create a mapping from the .xls names to the names in the existing shapefile and then add new fields for my data and add it to the corresponding country shapes. Turns out in the excel file CÔTE D’IVOIRE is the only one with any fancy characters. Note, I am new to unicode, so I hope that renders as a capitol O with a circumflex in your browser too. The python csv module correctly reads the excel file as utf-8 encoded and so in python this name is represented with the string cote_str = u”C\xd4TE D’IVOIRE”. The ‘u’ prefix indicates it is a unicode string. When printed to my console using “print cote_str” it is rendered using the default encoding of my terminal of utf-8 and displays as desired: CÔTE D’IVOIRE. However, using the repr() method I can get at the details: (u”C\xd4TE D’IVOIRE”) and see that the unicode code point value for this character is 0xd4. However, if I encode the string into a utf-8 byte string, I can see the utf-8 encoding for this character (c394) as it would be stored in a utf-8 encoded file, see this unicode/utf-8 table for reference:

>>> cote_str.encode(‘utf-8′)
“C\xc3\x94TE D’IVOIRE”

Had I thought to look I would have seen it clearly documented that “Natural Earth Vector comes in ESRI shapefile format, the de facto standard for vector geodata. Character encoding is Windows-1252.” Windows-1252 is a superset of ISO-8859-1 (a.k.a. “latin1″). However, it didn’t occur to me to check and I ran into some unexpected problems, since several country names in this shapefile had non-ascii characters.

The pyshp module doesn’t specify an encoding and so the default for python is used which is “ascii”. So for example I end up with byte strings with non-ascii characters: “C\xf4te d’Ivoire”. When printed to the terminal it is rendered to utf-8, but since 0xf4 is not a valid utf-8 encoding it renders as: “C�te d’Ivoire”. More problematic other operations won’t work, for instance I need to compare this country name to the ones in the .xls file. Note I found it confusing at first that both unicode and latin-1 share encodings for values 0-255, but utf-8 has different encodings above 128 (because of how utf-8 uses variable numbers of bytes, the upper part of latin1 is not valid utf-8 at all, wikipedia’s description chart shows it well).

The raw byte string:

raw_c = “C\xf4te d’Ivoire”

can be converted to unicode with the proper encoding:

u_c = unicode(raw_c, ‘cp1252′)

which is now a unicode string (u”C\xf4te d’Ivoire”) and will print correctly to the console (because print is converting it to the correct encoding for the console).

Just playing about some more.

raw_utf8 = u_c.encode(‘utf-8′)

raw_utf8 now stores “C\xc3\xb4te d’Ivoire”, note that utf-8 needs two bytes to store the correct o. This will print looking correctly to my linux console because utf-8 is being used by the console.

However, in windows again I get something weird looking, because the windows command line is using code page 437 as the console encoding. Using u_c.encode(‘cp437′) gives me a binary string that prints correctly in this case “C\x93te d’Ivoire”. Having fun yet?

Moral of the story, debugging unicode can be confusing at first. Using unicode strings is clearer.

Tired of typing in ‘\xf4′ etc? You can change the default python from using ascii to using other encodings by adding a special comment in the first or second line of the file;

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# This encoding allows you to use unicode in your source code
st = u’São Tomé and Principe’

Here is a good reference on typing unicode characters in emacs.

Now I am less confused and have all the tools I need to work with these files shapefiles in the imperfect but still pretty functional pyshp module.

  1. Convert latin1 binary strings to unicode using unicode(s, ‘latin1′)
  2. Add the needed custom mapping entries by typing in unicode.
  3. Convert the unicode strings back to latin1 before saving the shapefile.

Hacky, but it works.