Category Archives: Visualization

Using Mapnik to render custom colored Map images

Say I want my output map of the world to be colored with a customer specified set of specific colors for countries, water etc.  I also want to output a specific resolution for publication.

Map of World rendered in Mapnik

Natural Earth shapfiles have a the built in “mapcolor” attribute to group countries appropriately for this purpose, that will be helpful!

Q-GIS has some nice functionality for playing around with the shapefile and exploring the data. But for a final output image it doesn’t meet the following requirements:

  • Controlling output image resolution and cropping (you can save to .png or .jpg, but there are no other options regarding resolution or dimensions to save).
  • Configuring colors and color ranges to use can be done (quick and dirty demonstration that the mapcolors attribute is what I expect is easy). But using Q-GIS quickly becomes a super manual process, you have to specify colors via separate RGB fields, ugh. You can specify an attribute to use for the color categories, but seems hard in that mode to configure specific colors to use.
  • Must reload modified shapefiles and re-configure them anytime the source file changes.

I don’t doubt that there are plugins for saving files (say if you want a vector image!!) and I could probably create my own symbology configuration for the colors I want, but since I am only catering to myself and don’t want to become an expert in some RSI inducing graphical interface… time to find a scriptable way to generate high quality output images.

Mapnik is a map renderer that has bindings for lots of languages including python. I started with Getting started with mapnik in python, which walks through a simple example in no time. First problem solved: simply specify the image size when you declare the map:

m = mapnik.Map(600,300)

And specify the area you want to display when you save it:

m.zoom_to_box(mapnik.Box2d(-180, -90, 180, 90)) # or m.zoom_all()  
mapnik.render_to_file(m,'world.png', 'png')

Unfortunately, coloring the country groups is not trivial. Mapnik doesn’t current have the ability to set the PolygonSymbolizer’s fill color based on a shapefile attribute (feature in the works). However, as I am now an expert at manipulating shapefiles with pyshp, I hacked together an ugly but straightforward solution in a few minutes.

  1. Using pyshp, read in the shapefile to render and spit out n new files, one for each mapcolor group. See split_mapColors.py
  2. Create a unique mapnik style for each country color group and add each new shapefile to the map with the appropriate style layer. See renderCommon.py
Putting it all together, I used the script renderBaseline.py and other utility functions to generate the colored map of the world above. 

Using the same techniques morphed shapefiles can be rendered in the same color pallet, change your mind about  the colors? No problem.

Generating a cartogram of access to pain medication around the world.

Okay! Now that I have my data in a shiny new shapefile. Time to make some cartograms using ScapeToad.

The Data:  Morphine per Death as reported by the Global Access to Pain Relief Initiative (GAPRI)

I am working on this project with Kim Ducharme for WGBH, The World, for a series on cancer and access to pain medication, which highlights vast disparities in access between the developed and developing world. Below is a snippet of the data obtained from GAPRI, showing the top 15 countries for amount of morphine available/used per death by Cancer and HIV and the bottom 15 for which there were data.

                                   Country   mg Morphine/ Death
                            --------------   -------------------
                             United States   348591.575146
                                    Canada   332039.707309
                               Switzerland   203241.090828
                                   Austria   180917.511535
                                 Australia   177114.495731
                                   Denmark   160465.864664
                Iran (Islamic Republic Of)   149739.130818
                                   Germany   144303.589803
                                   Ireland   140837.280443
                                 Mauritius   121212.701934
                            United Kingdom   118557.183885
                                     Spain   116480.684253
                               New Zealand   112898.731957
                                   Belgium   108881.848319
                                    Norway   106706.195632

And the 15 countries with the least amount of morphine access:

                                   Country   mg Morphine/ Death
                            --------------   -------------------
                                   Burundi       38.261986
                                  Zimbabwe       34.508702
                                     Niger       31.359717
                                    Angola       30.485112
                                   Lesotho       25.998371
                                  Ethiopia       25.323131
                                      Mali       24.713729
                                    Rwanda       23.269946
                                  Cameroon       15.162560
                                      Chad       10.866740
                             Côte D'Ivoire        9.723552
                                  Botswana        9.352994
                                   Nigeria        8.780894
                              Sierra Leone        8.546830
                              Burkina Faso        7.885819

Traditional Cartogram

Based on these numbers of morphine/death, in a basic cartogram where each country’s area becomes proportional to the metric, Switzerland would be 60% of the size of the US. But wait… this wasn’t what I expected, gosh that’s ugly and hard to read… And so starts the cartogram study and tweaking experiment. Is there a perfect solution?

Morphine per Death (as Mass)

Note the countries of Europe are too constrained to get to their desired sizes, so there is always some error in these images. Regardless of that there are two issues: 1) Europe/Africa/Asia are so badly distorted as become nearly unreadable, and bring the emphasis to a fish-eye view of Europe with weird France and Switzerland shapes. 2) This seems to make the whole story about Europe, de-emphasizing the US and Canada, which have higher usage than any of the European countries and also taking the focus away from shrunken Africa/Asia and South America.

This seems to be the best that the diffusion based contiguous cartogram is going to be able to do for this data set. ScapeToad has some options for mesh size and algorithm iterations, none of which seem to significantly effect the output image in this case. The other option is to take your metric and apply it as a “Mass” (as above) or as a “Density” to each shape. ScapeToad explains what the Mass/Density distinction is pretty well:

In our case Morphine/death is a “Mass/Mass” ratio which is also a “Mass”. However, for kicks I ran the “Density” option which is technically wrong (scales the area of each country based on the metric, instead of making the area proportional to the metric as a traditional cartogram should). Low and behold, the density image is certainly more satisfying and seems to tell a better story, although over-emphasizing the role of the US, Canada and Australia, which all dwarf Europe:

Morphine per Death (as Density)

Well, this is a quandary, the “correct” image is too confusing to be useful and takes the focus away from the story about the developing world and into what-the-?-is-this-distorted-picture land. But the “density” image is not “correct”.

From here I spent some time trying to generate a less distorted mass based cartogram. By running the cartogram generation on each continent separately I generated much less distorted images of Europe and Africa (Asia still needs some work). Shown here are the raw outputs for these regions in green, purple and pale blue respectively.

Morphine per Death (as Mass) by region, unscaled

To piece the cartogram back together the continents needed to be scaled and translated to the correct locations. Here is how far I got in that process. Europe is much easier to read and Africa is a huge improvement. Asia/the Middle East are still quite confusing, potential for improvement breaking this into more chunks, but it was becoming a more and more manual process and the output image still isn’t “satisfying”.

Morphine per Death (as Mass) each region calculated separately, then scaled appropriately to maintain more recognizable shapes

Does this cartogram tell the story we want? Does it really make sense to honor country borders and make small countries as large as big countries that have the same morphine/death value? For example, all things remaining equal, if the German and French speaking parts of Switzerland split into two new countries, given the same morphine/death number should each of the two halves have a cartogram area equal to previous Switzerland, effectively doubling the size because of  a political change? That doesn’t make much sense, but that would be considered a technically “correct” cartogram measure. It seems to me in some ways scaling the area is more correct as in the Morphine/Death as Density image, as it doesn’t exaggerate small countries with smaller populations…

In the end it is possible to generate a lot of different cartogram images. Some of which are suggestive of the story you want to tell, none of which are easily deciphered to provide actual data numbers. Keeping in mind that a cartogram isn’t a tool for communicating precise data measures, I think pick the one you like that makes sense to you vis-á-vis your data and the story you want to tell, don’t overstate the accuracy of the image, and provide other means to get at the actual numbers. For example, I created an alternate view in this choropleth map of the same data.

UPDATE 2012/12/03:

The final image is published now as part of PRI’s The World new series on Cancer’s New Battleground — The Developing World.

Access to Pain Medication around the World

Adding custom data to a shapefile using pyshp 1.1.4

As part of a cartogram generating project I need to get data from a .xls file into a shapefile for use with ScapeToad. Reading the excel file is easy with xlrd. Shapefiles? that is new territory.

Shapefile resources:

ScapeToad requires a high quality shapefile as input. I tested two that worked fine, both include population data suitable for testing:

  1. Thematic Mapping: http://thematicmapping.org/downloads/world_borders.php TM_WORLD_BORDERS_SIMPL-0.3.zip. Which is licensed under the Create Commons Attribution-Share Alike License. This license is not appropriate for commercial use and the author didn’t respond to my question regarding other licensing options.
  2. Natural Earth: On a tip from Andy Woodruff, I switched to using Natural Earth shapefiles which are licensed completely in the public domain, suitable for anything you want! Note, I discovered that the most zoomed out Natural Earth file, “ne_110m”, didn’t have shapes for some of the smaller island countries in my data set, so switched to using the “ne_50m” versions which included everything I needed.
Next step, getting custom data into the shapefile.

Using pyshp to add data to a shapefile

Since I do most of my data processing in python anymore, I was happy to find a python module for read/write/editing shapefiles. Unfortunately, pyshp.py is not the best maintained module. I used pyshp version 1.1.4, because it was packaged in ubuntu. After discovering a number of bugs I realized they have already been reported but nothing significant seems to have been fixed in 1.1.6. So I will just document the workarounds I used here.

1st pyshp 1.1.4 workaround: Renamed the shapefiles to remove any dots in the file name (the 0.3 in the case of thematic mapping shapefiles) because pyshp can’t handle extra dots in the file name.

This is kinda a nuisance since there are 4 files in a “shapefile”. This command will rename the extensions “dbf”, “prj”, “shp” and “shx” all at once:

 for a in dbf prj shp shx;do mv TM_WORLD_BORDERS-0.3.$a TM_WORLD_BORDERS_dot_free.$a;done

2nd pyshp 1.1.4 workaround: Massage numeric data you are adding to a record to have the correct precision.

My whole reason for using pyshp is to add data from excel into the shapefile. This means adding fields to identify the data and then adding the data to the record for each shape. The format of the new attributes (a.k.a. fields) is well described here. In my case I want to add numbers for example: sh.field(‘MY_DATA’, ‘N’, 6, 3). The number args are width and precision, where width is the total number of characters to represent the number and precision is the number of characters after the decimal. The above (6,3) can encode: -1.234 and 98.765.

Note, pyshp will error (AssertionError assert len(value) == size) if you put data into the record with greater precision than specified (it will not truncate for you).  I used a simple hack below to get a precision of 3 for my data:

    def precision3(n):
        ''' Force argument to precision of 3'''
        return float('%0.3f'%n)

3rd pyshp 1.1.4 workaround: When adding a new data field, pad all the records with default data for the new attribute.

pyshp assumes when saving the file that the data is perfectly formatted, but doesn’t help too much when adding or deleting data. Records are stored in python as a list of lists, when the shapefile is written pyshp assumes that the records lengths equal the number of fields (as they should be). But it is your job to make sure this is true (if not the records will wrap around and become non-sense). Q-GIS is useful for inspecting shapefile attribute tables to discover issues and verify that your new shapefile works in an independent reader.

In my case data wasn’t available for all countries, so I padded with a default value (appended to the end of all records when adding the field) and then looped through and put the correct data in the records for which data was available.

Example here adding a new field for Numeric data and default data to all records. All my data is non-negative, so magic number “1” is for the decimal point.

    def addField(name, widthMinusPrecision, precision = 3, default = 0): 
        sf.field(name, 'N', widthMinusPrecision+precision+1, precision)
        # add default data all the way down to keep the shape file "in shape"
        for r in sf.records:
            r.append(default)
        return

4th pyshp 1.1.4 workaround: delete() method doesn’t work, don’t use it.

Each shape is described by two pieces of data, linked together based on their index. When deleting a shape, both the record (with the meta data) and the shape (with coordinates etc) must be removed. If only one is deleted pyshp will add a dummy entry at the end and many of your records and shapes won’t line up anymore. To delete a shape, you must delete both the shape and the corresponding record. The delete method doesn’t do this, don’t use it, do it yourself:

    def deleteShape(n):
        del sf._shapes[n]
        del sf.records[n]
        return

5th pyshp 1.1.4 workaround: Handle non-ascii character encodings yourself

pyshp doesn’t declare a character encoding when reading files, so they default to “ascii”. If you are using the Natural Earth shapefiles they have non-ascii characters and are encoded in Windows-1252. (See previous post for more info about the Natural Earth encoding.) I worked around this by looping over the records and encoding all strings to unicode:

    for r in sf.records:
        for i,f in enumerate(r):
            if isinstance(f, str):
                r[i] = unicode(f,'cp1252')

And then reversed this before saving the file via:

    for r in sf.records:
        for i,f in enumerate(r):
            if isinstance(f, unicode):
                r[i] = f.encode('cp1252')

6th pyshp 1.1.4 workaround: When looking at sf.fields adjust the index by one to ignore ‘DeletionFlag’

pyshp adds an imaginary field for internal state tracking to the beginning of the fields list. If you are looking up field names in this list to find indexes, you should correct your indexes accordingly, there is not actually a field called ‘DeletionFlag’.

Conclusion:

After working around these bugs and massaging my country names to map from .xls to the names in the shapefile (17 cases of “Bolivia (Plurinational State Of)” == “Bolivia”) , I was able to use pyshp to generate a new shapefile with my data in it! Next up, cartogram-orama.

Cartogram Basics

I am working on a Cartogram of the World with my friend Kim Ducharme. We are looking for something dramatic to show disparity between affluent countries and developing countries and a cartogram seems a great way to hit the message home. This is my first foray in to the world of cartograms, so here is some useful background.

A cartogram is a map where the areas of regions have been adjusted to represent some other metric of interest. They are intentionally distorted “maps” and yes there is controversy over them :).

Cartograms come in a few different flavors (see indiemaps summary):

  1. Non-contiguous Cartograms: Each object (state/country/etc) grows or shrinks independently of its neighbors. With the result being perfectly accurate, but the original map is filled with white space. Excellent history of Non-contiguous cartograms here.
  2. Dolring Cartograms: Replace the regions with circles or squares I won’t discuss these more.
  3. Contiguous Cartograms: Attempts to keep boundaries connected and distorts the shapes (often grossly) attempting to scale the country areas according to some metric. The canonical approach seems to be the Gastner/Newman diffusion method.
Our preliminary investigation showed that the Non-contiguous Cartogram was not a very satisfying image. Too much white space and just doesn’t have the punch of a wildly distorted contiguous cartogram.
Here are is an example of the diffusion method image generated by Mark Newman, this one represents Total spending on Healthcare, check out Mark Newman’s site for more in this family:

Having decided to pursue generating a contiguous area cartogram, first step was to find out how.

Selecting a Cartogram program:

There are a few options I found for generating a contiguous area cartogram:

  1. Download the source for Gastner/Newman’s cart program, compile and run it. Looks doable, but would need to massage my data into some grid format. Maybe someone else has made it easier…
  2. Apparently ArcGIS has a Cartogram Geoprocessing Tool based on the Gastner/Newman method too. But I don’t have a budget for fancy commercial software.
  3. ScapeToad: At last, a ready to go cartogram generating program also based on the Gastner/Newman method. This one has a GUI and is released under the GPL. Perfect!

Using ScapeToad to generate a cartogram:

Using ScapeToad is easy! It has simple instructions and I only ran it a few issues easy to work around. It uses ESRI shapefiles and will output the updated image as an ESRI shapefile (with error  data added in). I also found the ScapeToad documentation pretty helpful. The only annoying issue  on my system was the recently used files selection silently didn’t work. So when opening a shapefile (via “Add Layer…”) I had to always browse to the correct location. I will discuss about the “Mass” and “Density” options in another post. Otherwise there is not much to say, I will let ScapeToad speak for itself.

ScapeToad requires the shape file to have “perfect contiguity”, so find a suitable shapefile and test it before moving on. As discussed in a previous post, the Natural Earth shapefiles are now my go to. These files conveniently have some population data you can use to test the ScapeToad is working.

More on the real challenges, adding your own data to a shapefile, coming up…

Reprojecting maps with QGIS

I figured out how to reproject maps with Q-GIS and would like to celebrate with this lovely image, a US Atlas Equal Area Projection, which I thought was the most fun of the built in projections in Q-GIS 1.8.0:

If the Q-GIS documentation site is working it isn’t too hard to figure out how to reproject. Quick summary:

  • File >> Project Properties: You can set the project Coordinate Reference System (CRS), if you “Enable ‘on the fly’ CRS transformation” you can play around with the different projection options.
  • To set the CRS on a shapefile layer, right-click on the layer name and select “Set Layer CRS”, this is particularly useful if you don’t have ‘on the fly’ enabled, then your layers will not display if they are in a different CRS than the current project one, in which case you can change them here so they will display.

Note for practical equal area projections I found the Mollweide options seemed to work well, but don’t understand yet the difference between world and sphere options. Anyone?

Funny, it wasn’t untill I started looking at these equal area projections that I realized how big Russia is!

A better simple map

I am learning to use pyshp.py (a pretty buggy but functional python module for reading and writing ESRI shapefiles) and Quantum GIS. As a quick demonstration I replotted the data from an earlier map. Q-GIS makes it pretty easy to adjust the appearance. The world map shapefile is from Natural Earth.

This time I generated a shapefile from python directly, super easy (I will highlight problems with pyshp in a future post, but creating this simple file worked fine, although doesn’t seem to define a CRS, I am pretty sure it is WGS 84). Here is the code:

import shapefile
w = shapefile.Writer(shapefile.POINT)
max_len = max([len(s.name) for s in stations])
w.field(‘NAME’,’C’,’%i’%max_len)
for s in stations:
    w.record(s.name)
    w.point(s.lon, s.lat)

w.save(‘stations’)

Bikes In / Bikes Out — Hubway Data Viz group submission

Winner of Best Analysis!

Together with Kim Ducharme, Kenn Knowles, Verena Tiefenbeck, I created an interactive visualization of bike movements throughout the city on a typical weekday. In order for the Hubway bike sharing system to work, there must be a bike available when desired and an empty dock available when returning. At peak hours at some stations the imbalance introduced by commuters is enough that Hubway has to resupply and remove bikes during rush hour. The visualization we created allows you see the bikes in and out of a station throughout the day. Also using the map you can see the imbalance throughout the day. It is interesting that while the flow is high at lunch time for example, it is pretty evenly balanced in the city. However during prime commuting hours the commuting patterns are visible and different patterns around entertainment spots like Harvard Square in the evenings.

Check out the interactive Bikes In / Bikes Out visualization and the other entries in the Hubway Challenge.

.

Many thanks the challenge organizers for hosting a Hackathon where I met my team for the first time and to the team members for being awesome to work with and for making this happen in such a short time.

Kim — Thanks for laying out the concept and giving us immediately something to work towards and your keen eyes for the visual details and color scheme. Sorry we couldn’t implement all the nice-to-haves! Your enthusiasm and encouragement were great as well.

Kenn — Wow, working with you was a great introduction to so many technologies on my “to learn” list and several I hadn’t even heard of yet. It’s almost embarrassing to list them, but here goes, this was the first time I used: CSS, Compass/sass, git/git-hub, javascript: d3.js, underscore.js, jquery.js, knockout.js. Thanks for putting the infrastructure and architecture together, there is no way I could have put this together on my own, but feel well positioned for it next time.

Verena — Thanks for your enthusiasm and insights on data trends especially casual versus registered usage patterns. I’m sorry we didn’t get to highlight that more!

Hubway Data Visualization Submission!

Click on the image below to see the full resolution image of my submission to the Hubway Data Challenge (which is huge and should be made into a poster, it’s the only reasonable way to view it, sorry!) The explanation of the graphic is included below. Do you like it? Check out this and the other visualizations. This entry was 16th in the popular vote of 67 entries. Not bad for a days work!

Hubway Station Connectivity Matrix

This graphic provides a breakdown of all station to station Hubway traffic, by month and by hour for the first 15 months of Hubway operation.

At the macro level, looking from top to bottom, you can see the growth of Hubway traffic since the system started in July 2011 (top row), to August 2012, on the bottom. Seasonal decline is visible near the winter months when Hubway does not operate (central black band). Notice the top-bottom symmetry around the winter, with fewer evening riders in November and March.

Variation throughout the day is visible from left to right. Quiet activity in the night swells for commuter/work day traffic. If you look very carefully you may see commuter bands at 8am and a more general swell between 4-6pm, which is explained by higher causal use of the system in the afternoons.

At the pixel level you can see the inter-connectivity of the 95 Hubway stations represented in the data-set. Each hour/month combination creates a 95×95 pixel matrix. Each pixel in this grid is colored to represent the number of trips that originate at the station specified by the row index which ended at the station specified by the column the pixel is in. The stations were organized roughly by neighborhood and the list and pixel indexes are provided at the right.

A first thing to notice in these small matrices are the diagonal bands running upper left to lower right. These represent the 7% of the total trips that started and ended at the same Hubway station. Clusters around diagonal indicate traffic within neighborhoods. Boston, with the commuting hot spots of North Station and South Station, is unsurprisingly the brightest region of the matrix. Interestingly, in the evenings the city traffic doesn’t really stand out.

Looking at the top to bottom trend you can see the addition of stations throughout Hubway’s operation, most notably several new stations in July and August 2012, which fill in the empty spaces from before those stations were installed. It is interesting how interconnected the stations are, with even the furthest station distances having been traveled by some intrepid rider. It is also interesting to note that approximately half of the colored pixels represent only one trip made between those stations during that month. Most of the traffic are the expected commuters and tourists.

How I made it

This visualization is based on the station connectivity matrix images I had generated earlier when looking at the Hubway data. Several people thought they were interesting, but I struggled with how to capture the trends, should I do an animation? Something with slider bars where they could move either forward by month or by hour? There are so many stations it is really hard to label them on a plot and it looks all lop-sided and imbalanced. Also the matrix really flattens the relationship between the station locations, but also trying to put it on a map is just a mess as well (although some people have done some great things in the Hubway Visualizaton Challenge, you should check it out.

Hubway extended their deadline and our team submission was nearly complete so I decided to see what I could do to visualize this data I had gathered in 24 hours before the deadline. First I made a web page, was planning to throw in some slider bars and tooltips/mouse over information to provide the data and highlight station names. That was going okay. One of the problem is that this is really a large amount of data, each plot has nearly 10,000 data points and I wanted to show the breakdown by hour within each month. Decent looking .pngs were too big. I spent some time thinking about how to store and transmit the data.

Then I took a mini break and finished re-reading Tufte’s Data Visualization book, where I was reminded of the idea of small multiples as a way of presenting data. Whew! Could I do that?

Okay, I created a web page with small multiples (back to my novice web layout nightmares of flowing images). The png’s at 1″x1″ were looking pretty bad.

Next I tried to generate an .svg with the data from one matrix and then see if I could composite them together. Just one crashed Inkscape. It’s a lot of data and a lot of little vector circles! Although I liked the number of trips being represented by circles, the sheer magnitude of the data pretty much ruled that out.

So a pixel map it is. Then I started getting super excited, I could align the matrices so that a new one starts every 100 pixels and then the legend would be in the pixel location! In the end I didn’t like the visual padding between the matrices and so they are at  95 pixel intervals, corresponding to the 95 stations.

I ended up generating the main image using the Python Imaging Library, since my data was all in python already it was extremely easy! I then added the text and legend with the GIMP.  I spent an inordinate amount of time trying out different color schemes and wasn’t really happy with any of them. This data distribution is very non-linear. 1/2 of the colored pixels represent 1 trip. This is definitely an area worth more study.

I am super excited about how many of the trends we know about were visible in this graphic. I am also excited about the amount of real underlying data I was able to include. Unfortunately the image is huge. But maybe I will have a poster made to celebrate my first visualization entry of all time.

Errata

The pixel count is too high, I removed the data from October 2012 for which there was incomplete data and including it was visually confusing and misleading in terms of the trends, but forgot to update the numbers.

The legend doesn’t include all of the colors used, I struggled with trying to find a color scheme that was pleasing and also conveyed the hot spots. I conclude it is hard to illustrate hot spots with only one pixel, magenta and purple pixels all count into the red category of very high traffic. A handful of pixels represent over 100 trips in the specified month/hour.

ToDo’s

I would like to try grouping stations by popularity within neighborhoods. I suspect with this change some neighborhoods would be easier to identify and provide a better “legend” of sorts.

Learn a vector drawing package and render the text in a vector form.

Correct issues above, perhaps tweak the color scheme some more or try grayscale.

Rotate month axis markers to be horizontal and adjust paragraph width on description and generally perfect the layout.

Bike Accumulation at Hubway Stations

Looking at the whole dataset gives a sense of how much effort must be spent keeping bikes/empty space at different stations as needed. To some extent users are responsible for this because they have to leave the bike somewhere, which might mean going to further to an empty station.

The following shows the normalized accumulation of bikes as a percentage of the total number of trips from each station:

The following shows the net bike accumulation or depletion: