Friday, March 14, 2014

Douglas-Peucker simplification... in PHP

Ramer-Douglas-Peuker. These immortal names send a warm thrill into the heart of everybody who works in GIS in any capacity. Simplifying geometries by removing "unused" points can reduce download times substantially, and since the vogue these days is to go heavy on the GeoJSON it helps to keep your GeoJSON as light as it can be (well, until you cut too many corners and have to back it off).

Go Get Some!

I know you're impatient. Head on over and just grab the code if you're in a hurry:
https://github.com/gregallensworth/PHP-Geometry/blob/master/Line_DouglasPeucker.php


The Need

So, there's an ArcGIS service and we want GeoJSON so the web site can load it from static files. This is for 3 reasons: a) the browser-side code is a lot simpler to load GeoJSON from an URL, than to load from ArcGIS services; b) hitting ArcGIS for updated data is silly since it won't be updated often, and loading from a static file means less server overhead; c) there's some post-processing needed on the data, e.g. standardizing "no" and "n" fields as false, converting the Length field from meters to miles and clipping to 2 decimals. One could argue that this is the client's job to massage their data, but I prefer not to argue and to just massage the data.


But there's a problem: The resulting set of trails totals 2.5 MB of GeoJSON. On my crummy household DSL that takes some 10 to 13 seconds to download, more if the PNGs and CSS files and JS files aren't already cached and are being downloaded. And during this time the would-be user is watching a loading spinner, because we can't render the trails to the map until we have them in the browser.

A colleague at GreenInfo Network ran the data through o' Dougie P with 3 meter tolerance, and the results were impressive. No visible change in the trails even when zoomed in pretty well -- the rounding errors we could see were already there due to the client's own digitization, and were no worse. So let's have Los Tres Amigos massage our data as we load it out of ArcGIS and into our GeoJSON file.

Problem: I found two RDP implementations in JavaScript, one of those being JSTS. The JavaScript ones don't do much good since getting the data into the browser is the problem we're trying to solve. I found only one for PHP, part of GeoPHP which presumed upon us using their object model (or in fact running Drupal). ArcGIS supplies the multilines as lists of lists, but nothing could just handle that unless I went all the way with GeoPHP which for this application was major overkill.

So, I wrote my own, and there is is on Github. Scroll to the bottom of this post if you just want the code.

The Results

The results are spectacular.

We fetch the data in WGS84 lat/lon and RDP it a tolerance of 0.00002 degrees (about 3 meters at this latitude). The file size dropped from 2.5 MB to 1.5 MB and now load in about 6  seconds. Since the map is loading tiles and PNGs are still loading, you barely notice.

And quality is as great as ever: again, the cut corners are the client's own data, and not introduced by the simplification.

The Code

So, now some whining about how tough it was. :)

The basic algo is fairly easy to comprehend (the best things often are):
  • Draw a straight line from the start vertex to the end vertex.
  • For all points aside from the start and end, find the distance (in whatever map units, even degrees) from the line. This was the hard part. As you do this, track which vertex is furthest from the line.
  • If none are outside the tolerance, then your segment is simply the first and last vertices and all other vertices in between are dropped.
  • If any are outside the tolerance, split the line into two pieces at that furthest-out vertex. Process each of those halves via the process above (recursively: if a half has a vertex outside the tolerance, split the half in half, ...).
The hardest part for me, was with finding the distance between a point and a line. The GeoPHP implementation uses a unit magnitude technique which I just couldn't follow, and which was giving me completely wrong distances when I tried to do the same. Eventually I settled on this technique using good ol' slope-intercept algebra. That sure blew some of the cobwebs off my high-scool algebra I don't mind saying!

The one lame hack I had to make, is that the slope-intercept technique doesn't work for a segment which runs completely vertical (the start and end vertices have the same longitude) since that's a "divide by zero" slope. So my hack: fudge one of the points' longitude by 0.00001 units, so there's a defined slope and the distance can be calculated.

I figure that this won't be a problem: 0.00001 units is quite small. Even in degrees that comes up to about 4 feet at the equator, and the GPS you used to measure the trail isn't that accurate anyway.

12 comments:

  1. Is venice lat, lon or lon, lat, does it make a difference?

    ReplyDelete
  2. Could you give some sample inputs for single line string and multi king string so I'm clear what your function is expecting? Thanks

    ReplyDelete
  3. Regarding map units: The tolerance uses the coordinates as-is, effectively your map's native units. If your ordinates are in latitude and longitude, this means that 1.0 is perhaps 60 miles at the equator. If you're using meters or feet, then the tolerance is meters or feet.

    If you're using lat/lon coordinates, you can effectively guesstimate using this:
    1.0 degrees = 60 nautical miles = 111120 meters = 364567 feet
    Thus, if you want to use 30ft tolerance, try 0.00001 degrees for a starting approximation.

    ReplyDelete
  4. As to an example: I'm enhancing the documentation right now, to include a specific example. Check it out: https://github.com/gregallensworth/PHP-Geometry/blob/master/Line_DouglasPeucker-Usage.php

    Thanks for the nudge there. And thank you for the interest.

    ReplyDelete
  5. Thank you for working this out. And thank you for publishing it. It is going to be a big help as I sail the world.

    ReplyDelete
    Replies
    1. just one question. The tolerance is in the unit you are working with. I am working with Lat/Lon so followed your calculations at the equator. Does any one know hoe to adjust as I go north or south? Thank you

      Delete
    2. There really isn't a formula for calculating the length of "a degree" at a given latitude. The north-south distance and the east-west distance will change at different rates.

      If you want it simplified by a real distance such as meters, you would need to reproject to a planar SRS first.

      If you're working with fairly regional data, e.g. a city or county, then rough estimates such as "0.00001 degrees is 30 feet" do work reasonably well. But for large areas spanning a lot of latitude, I'd recommend working with a planar SRS instead.

      Delete
  6. I applied a modification to tolerance (using lat lon as "regular" decimals):

    // if a polygon has few points, the algorithm will hardly affect it
    $toleranceFactor = round(count($verts)/10+0.5)*10;
    $tolerance = 0.000001*$toleranceFactor;

    ReplyDelete
    Replies
    1. Under 5 points, the algorithm does not affect the polygon, at 100 points, it removes up to 50%.

      I must now take into account the total line extension or polygon area to further control $toleranceFactor.

      Delete
  7. This comment has been removed by the author.

    ReplyDelete
  8. I populate the $verts array thusly:

    $sqlVerts = "SELECT `Latitude`, `Longitude` FROM `Waypoints` WHERE `PolygonId` = 513";
    //echo $sqlVerts;
    $resultVerts =mysql_query($sqlVerts);
    $count = 0;
    while ($row = mysql_fetch_array($resultVerts)) {
    $verts[$count] = $row;
    $count++;
    }

    ReplyDelete

Note: Only a member of this blog may post a comment.