Category Archives: Utils

LibreOffice auto correct results in JSON.parse() fail on unicode quote characters

Yikes, more character encoding problems! I am trying to format some demo data in a spread sheet for a visualization I am using. I often use the technique of saving an “excel” type document as CSV and then using python to convert the CSV to a JSON file to read in the browser. (Clearly this is a non optimal tool chain, but I usually don’t have the .xlsx file changing and so just run through it once).

Today, however, I attempted to read a JSON object from a cell and ran into interesting trouble. I needed to capture annotations to the data at specific data points, for example at t = 5.2 sec, “car jumped off the ramp”. In the LibreOffice file I have entered something of this format in a cell:

{ “5.2” : ”Car jumped off the ramp!” , ”10.6” : ”Crossed the finish line” }

This looks like valid JSON to me and if I type something similar a JSON evaluator will confirm for me it is valid JSON:

{
   "5.2":"Car jumped off the ramp!",
   "10.6":"Crossed the finish line"
}

However, when parsing this in javascript using JSON.parse() (required because the python csv.DictReader followed by json.dumps() only creates one level deep of a json structure, so these complex cell contents are stored in a string), I got the following message:

Uncaught SyntaxError: Unexpected token “

Humm… does this font reveal to you the possible problem? Looks like a weird quote character and the JSON spec is pretty clear, a plain old double quote is needed.

The file is saved as UTF-8 and printed to the terminal (verified the terminal encoding is also UTF-8 using “echo $LANG”) my quotes now print as “\xe2\x80\x9c” for the first one and “\xe2\x80\x9d” for the subsequent ones, which correspond to the Unicode code points “\u201C” and “\u201D”, also know as “Left Double Quotation Mark” and “Right Double Quotation Mark”. 

I was also able to confirm this by pasting the error message quote character into this online hex converter.

Turns out when editing the cell contents in LibreOffice (to adjust the format to something that *should* be valid JSON) my quotes were being auto corrected. You can disable this “feature” from the menu “Tools >> AutoCorrect Options…” by disabling the Double quotes Replace in the lower right of the dialog as shown below. loffice_auto_correct

In case your source file wasn’t under your control, the following python can be used to replace a unicode character:

# replace Left Double Quote with "
fixedStr = brokenStr.decode('utf-8').replace(u"\u201c', "\"").encode('utf-8');

It also turns out that if you edit your cell contents in a different editor (like emacs) and then paste it in, the auto correction is not applied, which can make debugging a bit more confusing!

Lastly, I found the Unicode confusables site interesting, there are 15 confusable quote characters. So many!

Trying out the d3 matrix plugin by Erik Solen

I missed a d3 meetup event this fall (due to a hurricane Sandy related scheduling adjustment). I heard great things about Erik’s talk and I set myself an action item of taking a look at what he presented, wish I had been there…

Turns out the talk was focused on making a jQuery plugin of some d3 code. As I have yet to architect any complicated sites, a lot of this was well over my head, especially without the audio of the talk. But I did some yak-shaving trying to understand what this code was about and what problems (which I have yet to run into) are being solved.

I got a rudimentary introduction to the following concepts, big and small ones:

  1. Require.js, AMD, “define”
  2. widget factory with jQueryUI, _create() versus _init() etc.
  3. d3 Nested Selection
  4. html table headings <thead>, yeah, I never used them before. :)
  5. unshift(), dito.
  6. XMLHttpRequest Error, can’t get to local file “url” with file request, must use HTTP request
  7. Python SimpleHTTPServer, serves up a local file through a GET request.
  8. “json” format uses double quotes only (prepared data file is read as ‘json’ and this threw me as default python output was with single quoted strings). No useful errors, where woud I have found them?  I guess this is more confusing because the file has a .js extension?
  9. buster javascript unit testing
  10. Fun emacs tricks: emacsrocks.com
  11. Ajax…

Questions that I have that are unanswered:

  1. What is a good way to get your data from python to javascript? I keep writing python code to output valid javascript, which seems pretty similar to what Erik did too. Is there a better practice?
  2. Why use ajax? Requires the HTTP server running, etc…
  3. Why not put the table sideLabels and topLabels in with the data? Humm… in this case it made sense to me so I did, though I guess if they weren’t changing it might make sense to have them separate.

Unfortunately I am too much a novice to understand the real value of this talk without the talk. I did use the code though to plot the adjacency matrix of character interactions from Shakespeare’s Antony and Cleopatra, which I was performing with friends on the night of the talk. This is not the best application for the d3 matrix plugin, but I guess I gleaned what I could from this missed opportunity.

Since what I really wanted is a matrix, it would be nice if it were square and the topLabels vertical, in which case it really makes more sense to use svg than to use a table. So while not perfect, I figured this was a good enough stopping point for this investigation.

Unicode with HTML and javascript

OMG, really, character encoding problems again??! The adventure continues now in the browser.

First problem: I tried to use the d3 projection.js module, but including it gives me the error “Uncaught SyntaxError: Unexpected token =” in projection.js line 3. Looking at the file I am initially confused:

(function() {
  var ε = 1e-6,
      π = Math.PI,
      sqrtπ = Math.sqrt(π);

Until I noticed this module does a lot of fancy math with characters like ζ, μ, λ, π and φ. Ah ha! Perhaps this is my problem. Lo and behold, my lazy html didn’t declare a character encoding. The error was resolved by adding the following in the head of index.html:

<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8” />

Fast forward sometime later… Oh no! My old friend Cote d’Ivoire isn’t looking right:

Looks like my shapefile data, now converted to GeoJSON is still encoded in a non UTF-8 encoding. Switching the html encoding to <meta http-equiv=”Content-Type” content=”text/html; charset=ISO-8859-1” /> results in the tooltip rendering as expected, but clearly I am not willing to give up on using 3rd party code that is encoded in UTF-8 am I?

Fortunately there is another easy fix. Simply specify the encoding on the data file when I import it:

<script type = “text/javascript” charset=”ISO-8859-1″ src=”non-UTF-8_data_file.js”></script>

And all is well again:

Whew! Previous battles with character encodings came in handy.

My introduction to unicode, pyshp.py and Natural Earth ESRI shapefiles

Working on a project for a friend I am adding data for different countries into a shapefile of the world. On a tip from Andy Woodruff, a cartographer I met at the Hubway Challenge Awards Ceremony, I switched to using the Natural Earth maps which are completely in the public domain. Great maps, with the added bonus of country names with non-ascii characters!

The data I started with is in Excel, with ALL CAPS country names. I needed to create a mapping from the .xls names to the names in the existing shapefile and then add new fields for my data and add it to the corresponding country shapes. Turns out in the excel file CÔTE D’IVOIRE is the only one with any fancy characters. Note, I am new to unicode, so I hope that renders as a capitol O with a circumflex in your browser too. The python csv module correctly reads the excel file as utf-8 encoded and so in python this name is represented with the string cote_str = u”C\xd4TE D’IVOIRE”. The ‘u’ prefix indicates it is a unicode string. When printed to my console using “print cote_str” it is rendered using the default encoding of my terminal of utf-8 and displays as desired: CÔTE D’IVOIRE. However, using the repr() method I can get at the details: (u”C\xd4TE D’IVOIRE”) and see that the unicode code point value for this character is 0xd4. However, if I encode the string into a utf-8 byte string, I can see the utf-8 encoding for this character (c394) as it would be stored in a utf-8 encoded file, see this unicode/utf-8 table for reference:

>>> cote_str.encode(‘utf-8′)
“C\xc3\x94TE D’IVOIRE”

Had I thought to look I would have seen it clearly documented that “Natural Earth Vector comes in ESRI shapefile format, the de facto standard for vector geodata. Character encoding is Windows-1252.” Windows-1252 is a superset of ISO-8859-1 (a.k.a. “latin1″). However, it didn’t occur to me to check and I ran into some unexpected problems, since several country names in this shapefile had non-ascii characters.

The pyshp module doesn’t specify an encoding and so the default for python is used which is “ascii”. So for example I end up with byte strings with non-ascii characters: “C\xf4te d’Ivoire”. When printed to the terminal it is rendered to utf-8, but since 0xf4 is not a valid utf-8 encoding it renders as: “C�te d’Ivoire”. More problematic other operations won’t work, for instance I need to compare this country name to the ones in the .xls file. Note I found it confusing at first that both unicode and latin-1 share encodings for values 0-255, but utf-8 has different encodings above 128 (because of how utf-8 uses variable numbers of bytes, the upper part of latin1 is not valid utf-8 at all, wikipedia’s description chart shows it well).

The raw byte string:

raw_c = “C\xf4te d’Ivoire”

can be converted to unicode with the proper encoding:

u_c = unicode(raw_c, ‘cp1252′)

which is now a unicode string (u”C\xf4te d’Ivoire”) and will print correctly to the console (because print is converting it to the correct encoding for the console).

Just playing about some more.

raw_utf8 = u_c.encode(‘utf-8′)

raw_utf8 now stores “C\xc3\xb4te d’Ivoire”, note that utf-8 needs two bytes to store the correct o. This will print looking correctly to my linux console because utf-8 is being used by the console.

However, in windows again I get something weird looking, because the windows command line is using code page 437 as the console encoding. Using u_c.encode(‘cp437′) gives me a binary string that prints correctly in this case “C\x93te d’Ivoire”. Having fun yet?

Moral of the story, debugging unicode can be confusing at first. Using unicode strings is clearer.

Tired of typing in ‘\xf4′ etc? You can change the default python from using ascii to using other encodings by adding a special comment in the first or second line of the file;

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# This encoding allows you to use unicode in your source code
st = u’São Tomé and Principe’

Here is a good reference on typing unicode characters in emacs.

Now I am less confused and have all the tools I need to work with these files shapefiles in the imperfect but still pretty functional pyshp module.

  1. Convert latin1 binary strings to unicode using unicode(s, ‘latin1′)
  2. Add the needed custom mapping entries by typing in unicode.
  3. Convert the unicode strings back to latin1 before saving the shapefile.

Hacky, but it works.

Picture viewing web page

I wanted to share pictures taken from our Antony and Cleopatra Reading this week and decided to create a page with thumbnails and links to the full pictures. Sure there are programs out there that do it and host for you, but it is so hard to get at the full res images and I guess they just annoy me, so I wanted to do it old school. Also I figure it is a good opportunity to practice some of the skills I have been exploring lately.

1) Write a shell script to shrink the image files and create thumbnails. Well actually putting it in a batch file is overkill, but it’s the first shell script I have created, pretty simple:

  • create a file with .sh extension and give it executable permissions with “chmod -x [name].sh”
  • add “#!/bin/sh” at the top
  • add commands that you want to execute from the commandline below

2) shrink all the files, I used imagemagick (convert) to do this the -thumbnail option is designed for creating thumbnails and the x300 makes the final image 300 pixels tall and maintains the aspect ratio. The command I ended up using was:

ls *.JPG | xargs -0 | xargs -I {} -n1 convert {} -thumbnail x300 thumbs/thumb_{}

3) I used a little java script to generate the web page.

  • first created a list of all the files using underscore.js, then
  • used d3.js to populate the appropriate number of links to display the thumbnails as links to the full size images

The resulting picture page is not fancy, but functional.