Working with shapefiles
I’ve been working with the City of Toronto’s open data catalogue, specifically the Bikeways route data. The route data is provided as an ESRI Shapefile, a once proprietary format, now an open standard.
If you go to the Toronto open data site, you’ll see there are two shapefiles provided: “MTM 3 Degree Zone 10, NAD27” and “WGS84 (Latitude/Longitude)”. After a short crash course on coordinate systems and projections, I found I’m in luck: WGS84 is what Google Maps uses, so the data is already in the correct format.
There are open source tools you can use to convert from shapefiles to something more usable, such as KML (if you’ve ever used Google Earth or the Maps API, you’re familiar with KML). There’s also a newer format called GeoJSON, which I’ll be using. In a stroke of luck, the Google Maps team just recently (March 2014) announced support for GeoJSON, through it’s new Data Layer functionality.
The first naive approach I took was to just dump the whole file into GeoJSON.
It didn’t take long, less than a second, and I was surprised not to be confronted with a wall of verbose text. I pushed it to Github. They render GeoJSON commits in a map view in the browser, so lacking any other tools to verify this output, I figured it’d be a good smoke test. What I didn’t realize was that the file was 50MB, not the 5MB I thought at first glance. The browser spun and spun, but eventually timed out and couldn’t render the file.
Clearly, it would take a bit more work. I got to reading about PostGIS, which is an spatial / geographical database extension for PostgreSQL.
I already had PostgreSQL installed via homebrew. I haven’t made any notable changes to the configuration. PostGIS can also be installed via Homebrew:
1 2 3
1 2 3 4
Importing the data
I then used the
shp2pgsql tool to generate an SQL script to load the data into PostgreSQL.
1 2 3 4 5 6 7
60,000 INSERT statements! At this point, I had no idea what the data looked like. The Toronto cycling map has a lot going on, but there’s certainly not 60,000 routes.
I loaded the data into PostgreSQL and started looking at it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
The data will need to be cleaned up significantly. It’s denormalized with a lot of repetition.
Immediately, I noticed some entries that were clearly not bike paths. 427/Gardiner, Brown’s Line, etc. It didn’t take long to figure out the common thread:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Well, that’s a big difference: 60,000 entries to 6,500 entries. It seems the Toronto open dataset also includes all of the roads in the city. I suppose this makes sense, as in order to make a cycling map you would still need to include the roads that are not specifically designated for cycling.
I considered deleting the roads, but they may serve some purpose to me in the future. I created a view to keep the roads out of the way, to focus entirely on bikeways.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Some fun queries
Now, for some simple analysis of the data.
Toronto bikeways, by name, sorted by length
1 2 3 4 5 6 7 8 9
Toronto bikeways, by type, sorted by length
1 2 3 4 5 6 7 8 9 10 11 12 13
I still don’t fully understand projections
Disclaimer: I’ve been using PostGIS for less than a week.
PostGIS can store both geometry and geography. Geometry is as we learned in grade school: 2 dimensional cartesian coordinates, simple to manipulate, familiar formulas for length, area, etc. Geography uses latitude and longitude, which are 3-dimensional spherical coordinates. Asking a
geography for it’s length will return an answer in
degrees, whereas a
geometry responds in
metres. Simply put, the two don’t really mix.
Some things I’ve learned:
- A geography is the “only” way to store latitude/longitudes natively without projections.
- A geometry can only be positioned on the globe when combined with a SRID.
- Geometry has more features than geography and is much faster.
- Geography is more appropriate for doing “global” calculations, e.g. trans-oceanic flight paths.
- Geometry is more appropriate for city, province/county level calculations
Here’s where it gets messy. One thing that worries me is the following quote:
You might be tempted to store latitude and longitude in a geometry type column. That is, to set up your PostGIS column with a geometry type, but use SRID=4326 (which is the EPSG number for WGS 84 latitude and longitude).
Don’t do this.
I’m pretty sure that’s exactly what I’m doing. It’ll take me a bit longer to understand exactly what this means and why it’s a problem, because for now, everything is working fine.
UPDATE: This GIS StackExchange question helps clear things up a bit. More in a future post.
Next step: Creating a Rails app, with migrations to automate the shapefile import and associated data normalization.