Monday, December 16, 2013

A PHP proxy to do both GET and POST, and the limitations of JSONP

It's been busy the last few months! Working on a few projects that aren't ready for release yet, so not a lot I'll say about them. But here's something that may come in handy.

The Problem & Solution


This web application, needs to fetch data from a remote data source. It's not based on the map per se, just a few dozen checkboxes to filter by this-n-that, search terms, some date filters, all submitted to the server and we get back some JSON. Pretty ordinary.

Problem is: it's on another server, sop we have cross-domain issues. The browser successfully makes the GET request, but the response body is blank and the XHR returns an error status.

The common solution, is to set up a proxy server. That is: it's a program hosted on the same site as your web application, and you can make your GET and POST requests to this program instead of the remote one on another site, it will do the GET and POST to the remote server, and return the content. This proxy server being on the website's domain, your cross-origin problem is solved.

Three simple steps.

#1, create this PHP program and call it proxy.php
<?php
// a proxy to call the remote ReefCheck server
// we were doing well with JSONP until we tried using too many
checkboxes, generating URLs that were too long for GET,
// and JSONP doesn't happen over POST

$PROXIES = array(
    '/count'   => 'http://other.remote.site.com/query/endpoint.php.php',
    '/search'     => 'http://my.remote.site/some/endpointc.cgi',
);

$url = @$PROXIES[$_SERVER['PATH_INFO'
]];
if (!$url) die("Bad proxy indicator:  {$_SERVER['PATH_INFO']}");
// are we GETting or POSTing? makes a difference...
if (@$_POST) {
    // compose a cURL request, being sure to capture the output
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_POST, TRUE);
    curl_setopt($curl, CURLOPT_POSTFIELDS,  http_build_query($_POST) );
    curl_setopt($curl, CURLOPT_HEADER, FALSE);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $output = curl_exec($curl);
    curl_close($curl);

    print $output;
} else {
    // easy way out: the base URL plus ? plus params
    $url = sprintf("%s?%s", $url, $_SERVER['QUERY_STRING'] );
    readfile($url);
}
 #2, Adjust the $PROXIES to match the endpoints you need. Pretty obvious: this is an aliasand a real URL, such as /search to http://some.site.com/api/v1/search.json

#3, change your JavaScript code to use your new proxy.
// old
$.get('http://some.site.com/api/v1/search.json', params, function(reply) { });
// new, note the /search matching the /search in $PROXIES
$.get('proxy.php/search', params, function(reply) { });

Dead simple. You should be able to make your queries, using exactly the same POST or GET parameters, and get back your reply exactly as if you had made the request to the real server. Except for the missing response body and error status.

Features & Shortcomings

This proxy script supports both GET and POST, which right off is a great start. And it supports multiple endpoints. And the GET stuff is done as an URL which makes debugging easy (cURL can do GET of course, but I find debugging simpler if it can dump an URL).

It does not support headers, either direction: no enctype, and no Content-type headers from the remote source. If you're using jQuery, you'll definitely want to use the data-type parameter (the 4th parameter to $.get and $.post, forcing the interpretation of the data type). Personally I consider that 4th param a good practice anyway... And for file uploads, the missing enctype may be relevant but there are iframe-based fixes for that anyway such as ajaxForm()

And keep in mind the most basic issue of proxies like this: it's effectively triple-transiting traffic: browser calls your server, your server calls the remote API, your server gets the data back, your server spits that data out to the browser. If your server's network throughput is a concern, adding someone else's API to your server's responsibilities may not be a pleasant necessity.

A quick recap: JSONP and the need for this proxy

As we started development of this application, we used JSONP instead. To recap if you're not familiar with JSONP:
You're probably used to returning JSON, which is a representation of a structure such as { title:'My Stuff', size:100 }  JSONP takes this a step further, and wraps that structure into a function call, forming executable JavaScript, like this:   handleResults({ title:'My Stuff', size:100 })  The name of the function being invoked, is defined by the &callback= parameter, which you sent along with your request, so you can in fact name the function that will be used, e.g. &callback=handleResults is simply one more param in your usual GET request.
This does presume that the API endpoint is programmed to handle the &callback= parameter and wrap the JSON output, and that you're willing to specify this one extra parameter in your request. (server-side: this really is simple to implement: you're about to spit out json_encode()'d data anyway, put if @$_GET['callback'] and change the output slightly if so) (client-side: if you use jQuery's $.get function, it can create a random function name for you, and bind your callback to it, and supply the callback param for you; very little labor here)
As long as your endpoint is JSONP-aware, and will accept a &callback= parameter and wrap the content in it, this is a great way to make your browser do the work itself without involving your proxy. Slightly faster transfer times, my server not needing to double-transit traffic, everybody wins...

...until the GET params become too much, and we must use POST.
But JSONP doesn't happen over POST!

You see, the spec changes at one point and we had to include A LOT of checkboxes and other such parameters, and the client's endpoint uses text names instead of primary keys, so it was entirely normal to construct an URL like this:
http://example.com/endpoint.json?counties[]=Santa Rosa&counties[]=Alameda&counties[]=San Mateo&species[]=Catfish&species[]=Salmon&species[]=Goldfish&species[]=Dogfish&species[]=Mulkey's Pollock
The URL params were now too long for GET, so remote servers start truncating our queries, hanging up on us, etc. so we must use POST. But, W3C specification is that JSONP doesn't work over POST, and if you try to use JSONP in jQuery it will automatically be changed to a GET.
As such,  as much as I was fond of using JSONP while it lasted... ultimately we had to go for a PHP proxy.

Wednesday, December 4, 2013

PHP: Calculate the centroid of a polygon

A project I've been working on, generates point markers onto a map. But the interesting part, is how I populate that database of points. It's a series of "drivers" for connecting to ArcGIS REST API, OGC WFS, CartoDB, and so on. More on that as it develops...

A need that came up today, was that this particular layer, being served from ArcGIS REST, is polygons and we need points. Normally I would calculate the centroid and use that... but this specific software is being developed on a plain web host: no GEOS or OGR, no PostGIS... just plain ol' PHP.

Some Reading... then Writing


So I did some reading:
http://en.wikipedia.org/wiki/Centroid#Centroid_of_polygon

And wrote this, a pure PHP function for calculating the signed area of a polygon, and then the centroid of the polygon:
https://github.com/gregallensworth/PHP-Geometry/blob/master/Polygon.php

Caveats


As described in the file, this is specific to our sorts of use cases: park boundaries, city boundaries, and the like. If the polygon is self-intersecting (that's a no-no) it may be wrong. If it's an L shape it may come up wrong too. And if it's a multipolygon and those rings overlap, it'll double-count the area, as it's not smart enough to find intersecting area between two rings and subtract it out (and this incorrect area may affect the centroid placement).

But, it does what we need it to do, and it may work for you. Enjoy!


Monday, November 4, 2013

Setting up a CDN using Amazon EC2 Instances

Today's topic, is setting up a content delivery network (CDN) for map tiles, using Amazon EC2 instances.

Why a CDN?


This particular set of map tiles is in use in dozens of maps that we know of (literally, about 50). The map tiles are very high quality, 20 - 30 KB apiece, and tiled maps being what they are, there can be 16 of them loading at a time for the base layer and 16 more for the labels layer.

They're being served from TileStache, so of course they're cached, and our 10 Gbps network continues to impress... but still, our server has other things to do, and nobody complains when map tiles load too quickly, do they?

The simplest approach would be the most expensive: get another server, set it up with TileStache, etc.That means 500 GB of storage, decent CPU and RAM, and hosting expense when we just want to host tiles. (sorry but we can't do a simple S3 of the PNGs, that presumes that all tiles are already generated, which isn't true)


Much cheaper, is a simple content delivery network (CDN) setup based on Varnish.


What's Varnish?


Varnish is a caching proxy server. You set it up on a server that isn't running a web server, configure it to point at a back-end web server (the real web server where your content resides), then point your browser at Varnish instead of your real web server. Varnish will silently cache things and serve them from its cache, refreshing from the back end web server as necessary.

Varnish can take a lot of tuning if you're outside their presumed default use case, but it's worth the trouble. Any item served from the cache, isn't served from your web server, and that saves throughput, disk IO, firewall traffic, etc. for other things.

So the basic idea is:
  • Get a server with moderate stats: fast network, 1 GB of RAM, and a chunk of disk space. The network is the only part that needs to be really performant.
  • Set it up with Varnish, configured to point at the TileStache server. The configuration is minimal, just cache everything since the PNG and JPEG files really are everything.
  • Update my map apps to point to the new CDN URLs, and let them populate the caches. Tiles are served from the cheap server, sparing load on our big server.

The Amazon EC2 Instance


If you'll be running this cache for a while, consider paying for a Reserved Instance, as it's a significant discount.

Go for a Small instance, as you don't need a lot of CPU and RAM. A Small provides 1.5 GB of RAM which is great, and a decent CPU to do the job. Of course, if you want to raise the bar for more RAM, go right ahead! For this plan, we're looking at 50 GB of cached tiles accumulating over the months, so the difference between 1 GB of tiles in RAM and 2 GB of tiles in RAM isn't significant.

Don't go for a Micro instance, though the free tier is tempting. They're very low performance, particularly network throughput which is your premium here.

I gave the instance 60 GB of standard EBS storage (in addition to its 8 GB root filesystem), set to delete on termination. This will form the cache files. It's rather excessive, but our intent is that this cache go literally months without throwing out anything. For your case, you may want to go with a lot less disk space, maybe only 5 or 10 GB.

Choice of OS is up to you. We went with Amazon Linux 64-bit which is the latest Ubuntu, so the instructions below may differ for you, as far as pathnames and package management.

Give it a while, log in, run the recommended yum update, and let's get started.


Installing Varnish


Yay for repos, right? This creates the varnish user and group, which will be useful later on when we create the cache files.
sudo yum install varnish

Setting up the EBS Volume


This is standard behavior when adding a new, unformatted disk volume to your server. These commands would partition the disk, format it with a nice, large inode size, register it into fstab, and mount it to get you started.
sudo bash
    cfdisk /dev/sdb

    mke2fs -i 4096 -L CACHE /dev/sdb1
    echo 'LABEL=CACHE   /cache    ext2    defaults    1    1' >> /etc/fstab
    mkdir /cache 
    mount /cache
    df -hT
exit
On the cache volume, we want to dedicate all of the space to Varnish. In order to reduce disk fragmentation, let's preallocate all of the space now. This creates a single file, 55 GB in size out of the 60 GB volume. Below we'll configure Varnish for 50 GB, the extra 5 GB is for overhead.
dd if=/dev/zero of=/cache/varnish.bin bs=1G count=55
chown varnish:varnish /cache/varnish.bin

Setting up tmpfs


A Small instance has 1.5 GB of RAM, which allows us to optimize Varnish a bit. Specifically, we can have Varnish write its non-persistent logfiles to a tmpfs, which means less disk IO.

While we're at it, Amazon Linux comes with a tmpfs already set up: about 800 MB and under /dev/shm. Let's get rid of that, as we won't be using it.

sudo vi /etc/fstab
    # comment out the line for /dev/shm on tmpfs
    # add this line    tmpfs       /var/lib/varnish    tmpfs   size=100M    0   0

sudo umount /dev/shm

sudo mkdir /var/lib/varnish
sudo mount /var/lib/varnish
You should now see your tmpfs listed, showing 100 MB of available space:
df -hT
Why 100 MB? Because Varnish's shared memory log takes up about 85 MB of space. There's no point in creating a larger tmpfs, but a little bit extra won't hurt.



Setting Up Varnish


There are two files of interest:
  • /etc/sysconfig/varnish -- The startup configuration file. This is read by the service command and defines more of the defaults such as the HTTP port on which Varnish should listen, and which VCL file to load.
  • /etc/varnish/default.vcl -- This configuration file instructions Varnish on what to cache and how to cache it. There's a lot of documentation on it, but it's still complicated.
The content of /etc/sysconfig/varnish is as follows:

# Configuration file for varnish
# /etc/init.d/varnish expects the variable $DAEMON_OPTS to be set
# from this shell script fragment.

# Maximum number of open files (for ulimit -n)
NFILES=131072

# Locked shared memory (for ulimit -l)  Default log size is 82MB + header
MEMLOCK=82000

# simple configuration
# a good chunk of our RAM (we have 1.5 GB, give it 800 MB) and that big cache file on the EBS volume
DAEMON_OPTS="-a XXX.XXX.XXX.XXX:80 \
             -f /etc/varnish/default.vcl \
             -T localhost:8080 \
             -u varnish -g varnish \
             -p thread_pool_min=200 \
             -p thread_pool_max=4000 \
             -s malloc,800M \
             -s file,/cache/varnish.bin,50G"
The content of /etc/varnish/default.vcl is as follows:
# Varnish VCL file for caching out TileStache tiles
# one named back-end, and some extreme caching for PNG and JPEG files

# our back end TileStache server
backend default {
  .host = "tilestache.myserver.com";
  .port = "http";
}

# extreme TTL! PNGs and JPEGs are kept for a full year,
# on grounds that it only changes once or twice twice per year so is really never stale,
# and that we would restart Varnish anyway (clearing the cache) when we eventually update
sub vcl_fetch {
    if (beresp.status == 301 || beresp.status == 302 || beresp.status == 404 || beresp.status == 503 || beresp.status == 500) {
        return (hit_for_pass);
    }
    if (beresp.http.cache-control !~ "s-maxage" && (req.url ~ "\.jpg$" || req.url ~ "\.png$")) {
        set beresp.ttl   = 30d;
        set beresp.grace = 365d;
    }

    return (deliver);
}

# add an extra header to the HTTP responses, simply indicating a Hit or Miss
# this is useful for diagnostics when we're wondering whether this is REALLY caching anything
sub vcl_deliver {
    if (obj.hits > 0) {
        set resp.http.X-Cache = "Hit";
    } else {
        set resp.http.X-Cache = "Miss";
    }
}

# Varnish won't cache effectively if there's a cookie, and a lot of our websites do use cookies.
# For tile caching purposes, ditch the cookies.
# Also, standardize the hostname: there are DNS aliases for tilestache, tilestache-1 through tilestache-4, etc.
# If we don't standardize the hostname, Varnish will cache multiple copies, from these multiple hosts.
sub vcl_recv {
    if (req.http.Cookie) {
        remove req.http.Cookie;
    }

    # we use one static hostname for all of these, but may be called as tilestache-1 through -4
    # standardize the hostname so it all goes into the one bucket
    set req.http.Host = "tilestache.myserver.com";
}
In the sysconfig, you need to either enter your own IP address for the XXXs, or else leave it blank for it to listen on all interfaces. On my system when I left it blank, someone hit port 80 on the other interface (on the 10.0.0.0/8 network) causing Varnish to start a second shared memory log. Not only is that useless work, but if you use tmpfs like I described, this second log wouldn't fit into the space available.

In the VCL, we strip out cookies. We can do that in this case, since we specifically only care to cache PNGs and JPEGs, and to cache them indefinitely without regard to sessions or the like. We also standardize the hostname, so Varnish won't cache multiple copies of the same tile for tilestache-1.myserver.com, tilestache-2.myserver.com, tilestache-3.myserver.com, and tilestache-4.myserver.com

The VCL also sets some extremely long caching for the PNGs and JPEGs, since our explicit goal here is cached images that won't be expiring for several months. And we add a custom HTTP header X-Cache which indicates Hit or Miss, which I find useful for debugging whether we're in fact caching anything.


Start It Up!


You should now be able to start the service, and set it to start when the server boots:
sudo service start
sudo chkconfig varnish on
If you run a ps you should see two instances of varnishd, and netstat will show your new network listener on port 80.

Give it a test: Get the URL of some thing that you think should be cached, e.g. one of the tile PNGs. Point your browser at the original and make sure you got it right. Now change the URL so the hostname points to your Amazon EC2 instance, and it should still work, having automatically fetched the remote content.

If you bring up Firebug or some other tool that shows HTTP headers, load up a few cached items, and look for the X-Cache header in the response. Some should say Miss, but subsequent loads of the content should start showing Hit. This X-Cache header is added by our VCL file, as described above, and is a good indicator that you're actually caching.

If you're interested in nuts and bolts, run varnishstat and varnishhist and look for your hit ratio. It will be very low at first, of course, because your cache is empty. But over time, the cache will fill and the hit ratio should go up.



And You're Done!


Looking good, huh? Great. Set up some DNS for your Amazon EC2 instance, so you can refer to it by a static hostname, then start updating your applications to use this new caching server. Bit by bit, the caches will fill, the hit ratio will increase, load on your main server will be reduced, and response times of the site will decrease.


About the Persistent Storage Back-End


An important note: the file storage backend purges the cache whenever you give a service varnish restart. Naturally, after this restart and your cache being empty, your hit rate is going to be awful until the cache fills again.

If you're feeling adventurous, you can edit the sysconfig file and change -s file to -s persistent, and Varnish will keep the cached files in between restarts. I didn't go with this because:
  • It's experimental, which doesn't give me a good feeling. Sorry to cop out on the community, though.
  • The store won't show up in varnishstat, so I can't get statistics on it.
  • These servers do nothing but Varnish, so should reboot annually at most. And giving it a restart is a super easy way to clear the cache, when we do make changes.

Elastic Computing At Its Finest


Some time back, I posted how we moved our web server off of Amazon EC2 and were enjoying improved performance and all that stuff. That's all still true.

But in our need for a remote caching scenario, a content delivery network where network speed was paramount and CPU & disk IO weren't strong needs, we have found an excellent use case well matched to EC2's (mediocre) hard disks and CPUs and (quite good) network.






Thursday, October 24, 2013

Weather Underground for historical tides & weather

In a recent (current) project, volunteers enter the results of surveys along the beach. As part of the survey, they are to note the weather conditions: visibility clean or limited, cloud cover percentage, temperature, precipitation yes/no, etc. They are also to note the height of the tide at that time.

Big bonus points, if I can make it look up the data at that time, date, and location, and have it auto-fill the boxes for them.

Most APIs are for Forecasts

It's easy enough to find a weather forecasting API. NOAA is one of several, and the GFS GRIB2 files can be had if you're hardcore. But those are low-resolution forecasts: what about yesterday's weather or the day before, and for a specific time instead of morning/mid/evening breakdowns?

And what about tides? Tide forecasts exist, but typically as full reports and not a readily-usable API. And you have to request a specific tide station, while we have simply a raw lat & long and no ready way to detect the nearest tide station. Besides, these are tide forecasts when we need yesterday's observations.

Weather Underground

Weather Underground has an API, and unlike others theirs goes into the past. Awesome. They also offer both weather observations and tide observations. And they have a free tier, limited to 500 hits per day. For our use case, this is way over a realistic usage for us, as these are office staff entering forms, not the general public hammering us with every hit.

So, step 1: sign up for an API key. Sign up for one that includes tides, and be sure to enable the History option when you sign up.


Making a Request / Creating an AJAX Endpoint

A request looks like this:
http://api.wunderground.com/api/APIKEY/history_YYYYMMDD/q/LAT,LON.json
Slot in the date, lat & lon, and your API key, and get back JSON. Dead simple.

In our case, I set up an AJAX endpoint written in PHP (CodeIgniter). It looks like this:
public function ajax_conditions($placeid,$date,$starttime) {
    header('Content-type: text/javascript');

    // validation: make sure they have access to this place ID, date and time filled in, etc.
    if (! preg_match('/^\d{8}$/', $date) )        return print json_encode(array('error'=>'Invalid date'));
    if (! preg_match('/^\d{4}$/', $starttime) )   return print json_encode(array('error'=>'Invalid starting time'));

    // code here translates between a $placeid and a lat / lon
    // yours will be very specific to your application

    // make the requests to wunderground for weather conditions and tide conditions
    $output = array();
    $latlon = $place->centroid();
    $time   = mktime((integer) substr($starttime,0,2), (integer) substr($starttime,2,2), 0, (integer) substr($date,4,2), (integer) substr($date,6,2), (integer) substr($date,0,4) );

    $weather_url = sprintf("http://api.wunderground.com/api/%s/history_%s/q/%f,%f.json", $this->config->item('wunderground_api_key'), $date, $latlon->lat, $latlon->lon );
    $weather_info = @json_decode(file_get_contents($weather_url));
    if (! @$weather_info->history) return print json_encode(array('error'=>'Could not get weather history. Sorry.'));

    $tides_url = sprintf("http://api.wunderground.com/api/%s/rawtide_%s/q/%f,%f.json", $this->config->item('wunderground_api_key'), $date, $latlon->lat, $latlon->lon );
    $tides_info = @json_decode(file_get_contents($tides_url));
    if (! @$tides_info->rawtide) return print json_encode(array('error'=>'Could not get tide history. Sorry.'));

    // weather as $weather_observation
    // go over the observations, find the one closest to the given $time
    // step 1: go over them, add a timedelta attribute, push onto a list
    $observations = array();
    foreach ($weather_info->history->observations as $observation) {
        $year = (integer) $observation->date->year;
        $mon  = (integer) $observation->date->mon;
        $mday = (integer) $observation->date->mday;
        $hour = (integer) $observation->date->hour;
        $min  = (integer) $observation->date->min;

        $obstime = mktime($hour, $min, 0, $mon, $mday, $year);
        $observation->timedelta = abs($obstime - $time);
        $observations[] = $observation;
    }
    // step 2: sort by timedelta, best observation is element [0] from the sorted list
    usort($observations,array($this,'_sort_by_timedelta'));
    $weather_observation = $observations[0];

    // tides as $tide_observation
    // step 1: go over them, add a timedelta attribute, push onto a list
    $observations = array();
    foreach ($tides_info->rawtide->rawTideObs as $observation) {
        $obstime = (integer) $observation->epoch;
        $observation->timedelta = abs($obstime - $time);
        $observations[] = $observation;
    }
    // step 2: sort by timedelta, best observation is element [0] from the sorted list
    usort($observations,array($this,'_sort_by_timedelta'));
    $tide_observation = $observations[0];

    // ta-da, we now have one observation for tides and one for weather
    // and they're the closest ones we have to the stated time

    // see below for code which massages the data into the desired output format

    // all set, hand it back!
    return print json_encode($output);
}

Some neat points here:
  • As is my usual fashion, I use sprintf() and preg_match() extensively for validating the input, and check for errors. Otherwise, some wise guy can supply invalid params and make nasty-looking requests to wunderground on my behalf (hack attempts from my server? no thanks!), or even generate an error which causes PHP to tell him what URL was used... including my API key.
  • The return from wunderground is in JSON, and that's just super simple to parse. The return to the client is also in JSON, because it's super simple to generate.
  • The trick to finding the correct forecast for the time I have in mind, is to figure out the "time delta" between each forecast and the target time. One can then use usort() to sort by time delta, and slice off the first element of the array. That being the lowest time delta, it's the closest to the target time.

A Little More On The Endpoint

Now, the endpoint does go a step further. The browser end of the app doesn't want the raw numbers, per se, but the simplified, digested version. They want the following:
  • air temperature in F
  • a simple yes/no about precipitation
  • a simple perfect/limited for visibility
  • a percentage cloud cover, even if estimated
  • the Beaufort measurement of the wind
  • the height of the tide in feet, including a prefixed + if it's >0
In the code above, you see the "code which massages"  Well, here it is:
    // compose output: weather
    $output['weather'] = array();
    $output['weather']['airtemperature'] = round( (float) $weather_observation->tempi );
    $output['weather']['precipitation'] = 'no';
    if ( (integer) $weather_observation->fog ) $output['weather']['precipitation'] = 'yes';
    if ( (integer) $weather_observation->rain) $output['weather']['precipitation'] = 'yes';
    if ( (integer) $weather_observation->snow) $output['weather']['precipitation'] = 'yes';
    if ( (integer) $weather_observation->hail) $output['weather']['precipitation'] = 'yes';
    $output['weather']['visibility'] = 'perfect';
    if ( (integer) $weather_observation->fog ) $output['weather']['visibility'] = 'limited';
    $output['weather']['clouds'] = 'clear';
    if ((string) $weather_observation->icon == 'mostlysunny')  $output['weather']['clouds'] = '20% cover';
    if ((string) $weather_observation->icon == 'partlycloudy') $output['weather']['clouds'] = '30% cover';
    if ((string) $weather_observation->icon == 'partlysunny')  $output['weather']['clouds'] = '50% cover';
    if ((string) $weather_observation->icon == 'mostlycloudy') $output['weather']['clouds'] = '80% cover';
    if ((string) $weather_observation->icon == 'cloudy')       $output['weather']['clouds'] = '100% cover';
    $output['weather']['beaufort'] = '1';
    if ( (float) $weather_observation->wspdi >=  4.0) $output['weather']['beaufort'] = '2';
    if ( (float) $weather_observation->wspdi >=  8.0) $output['weather']['beaufort'] = '3';
    if ( (float) $weather_observation->wspdi >= 13.0) $output['weather']['beaufort'] = '4';
    if ( (float) $weather_observation->wspdi >= 18.0) $output['weather']['beaufort'] = '5';
    if ( (float) $weather_observation->wspdi >= 25.0) $output['weather']['beaufort'] = '6';
    if ( (float) $weather_observation->wspdi >= 31.0) $output['weather']['beaufort'] = '7';
    if ( (float) $weather_observation->wspdi >= 39.0) $output['weather']['beaufort'] = '8';
    if ( (float) $weather_observation->wspdi >= 47.0) $output['weather']['beaufort'] = '9';
    if ( (float) $weather_observation->wspdi >= 55.0) $output['weather']['beaufort'] = '10';
    if ( (float) $weather_observation->wspdi >= 64.0) $output['weather']['beaufort'] = '11';
    if ( (float) $weather_observation->wspdi >= 74.0) $output['weather']['beaufort'] = '12';

    // compose output: tides
    // be sure to format it with a + and - sign as is normal for tide levels
    $output['tide'] = array();
    $output['tide']['time'] = date('G:ia', (integer) $tide_observation->epoch );
    $output['tide']['height'] = (float) $tide_observation->height;
    $output['tide']['height'] = sprintf("%s%.1f", $output['tide']['height'] < 0 ? '-' : '+', abs($output['tide']['height']) );
    $output['tide']['site']   = (string) $tides_info->rawtide->tideInfo[0]->tideSite;
The end result is exactly the fields they want, corresponding to the fields in the form.

Speaking of the Form...

Using jQuery. the additions to the form are relatively simple. It's a simple GET request, with the URL contrived to contain the /placeid/date/starttime parameters.
// make an AJAX call to fetch the weather conditions at the given place, date, and times
// along with disclaimer and credits per wunderground's TOU
function fetchWeatherConditions() {
    // remove the : from HH:MM and the - from YYYY-MM-DD
    var starttime = jQuery('#form input[name="time_start"]').val().replace(/:/g,'');
    var date      = jQuery('#form input[name="date"]').val().replace(/\-/g,'');
    var placeid   = jQuery('#form select[name="site"]').val();
    var url = BASE_URL + 'ajax/fetch_conditions/' + placeid + '/' + date + '/' + starttime;

    jQuery('#dialog_waiting').dialog('open');
    jQuery.get(url, {}, function (reply) {
        jQuery('#dialog_waiting').dialog('close');
        if (! reply) return alert("Error");
        if (reply.error) return alert(reply.error);

        jQuery('#form select[name="clouds"]').val(reply.weather.clouds);
        jQuery('#form select[name="precipitation"]').val(reply.weather.precipitation);
        jQuery('#form input[name="airtemperature"]').val(reply.weather.airtemperature);
        jQuery('#form select[name="beaufort"]').val(reply.weather.beaufort);
        jQuery('#form input[name="tidelevel"]').val(reply.tide.height);
        jQuery('#form select[name="visibility"]').val(reply.weather.visibility);

        // show the attribution & disclaimer
        jQuery('#dialog_wunderground_tideinfo').text('Tide information: ' + reply.tide.site + ' @ ' + reply.tide.time);
        jQuery('#dialog_wunderground').dialog('open');
    }, 'json').error(function () {;
        jQuery('#dialog_waiting').dialog('close');
        alert("Could not contact the server to load conditions.\nMaybe you have lost data connection?");
    });
}
Notes here:
  • I like to open a "please wait" dialog, because it can be 2-3 seconds as we get back a response.
  • The fields returned exactly fit those in the form. It's quite nice.
  • After populating the fields, I open a jQuery UI Dialog showing an attribution to wunderground, and mentioning where the tide data comes from since it may be several miles away from the actual location.

Conclusion

Unlike most other weather and tide prediction APIs, Weather Underground keeps historical records, and supplies both tides and weather in one simple API. And with their free tier, they really made my day... and our clients'.

Friday, October 4, 2013

javascript:void(0)

A common need that we come across (like, multiple times daily), is to have a button or hyperlink which opens a dialog or does some other action in JavaScript.

Way back in the late 90s, our technique was this:
 <a href="javascript:void(0)" onClick="doWhatever()">Click me</a>
But over the years, this was pointed out as being not entirely a good thing. It's a hyperlink to nothing, it can confuse screen readers (really? who has those, and do they really read out the entire URL?)
, and it's more semantically correct to do this:
<span class="lookslikelink" onClick="doWhatever();">Click me</span>

So I adopted this technique. A class called fakelink can be constructed which looks like other hyperlinks (cursor:pointer; text-decoration:underline; color:blue;) and now we don't have these semantically-incorrect null-links, just DIVs and SPANs with event handlers.

But, enter mobile...

As you have probably noticed, mobile devices won't necessarily detect these "hotspots", and they give strong preference to hyperlinks. On my six-month-old telephone running Android 2.1, for example, I have a menu of 3 links:
<li><a href="/postings">Postings</a></li>
<li><span class="fakelink" onClick="openSignupDialog();">Sign Up</span></li>
<li><a href="/catalog">Our Catalog</a></li>

On the desktop, this works great: three links, one of which opens the popup dialog. On mobile not so much: it's impossible to click the middle link. It seems the phone detects the tap and the nearest hotspot, and you've tapped on one of the two outside links. Even without a menu of other hyperlinks, the tap events often "just didn't work"

So, back to the old ways...

<li><a href="/postings">Postings</a></li>
<li><a href="javascript:void(0);" onClick="openSignupDialog();">Sign Up</a></li>
<li><a href="/catalog">Our Catalog</a></li>

The desktop doesn't really care about this sort of semantic violation, and it works on phones. And I'd rather have working links, than win an argument about semantics.



Saturday, September 21, 2013

Raster analysis in a web application, using PostGIS Raster


A few weeks ago, I posted some techniques for using Python and agoodle to analyze raster data,clipping to a polygon and then reporting min/max/mean and pixel counts. Another technique available as of a few months ago, is PostGIS Raster.

PostGIS 2.0 brings a lot of small changes and improvements, among them support for processing raster datasets. Yes, all the wonderful GEOS goodness of PostGIS... can now be done on rasters too.


Importing The Data



PostGIS can in fact import the binary content of your raster file into the database, as a set of binary blobs. We don't do this; we keep the raster in the filesystem (the -R flag) so we can download it, run other command-line programs, etc. Our use of raster2pgsql looks like this:


raster2pgsql -s 3174 -t 1000x1000 -I -C -F -R /projects/maine/rasters/landvalue_2005.tif




You can read the help in raster2pgsql to see the details about each flag, but in summary:

  • The table is created with a reference to a raster in the filesystem, and the SRS is stated.
  • The table will have multiple rows, each one being a 1000x1000 slice of the raster.
  • The column rast refers to the raster content.

Querying A Slice of The Data

So, we have a polygon such as county boundary, and classified raster data such as NLCD land cover codes, and we want to find the pixel values and pixel counts within that county.

First, a test of the raster and pixel counting:
SELECT ST_ValueCount(rast,1,false)
AS valuecount FROM nlcd2005;

This performs a ST_ValueCount() on the rast column, without any other sort of filtering, clipping, etc. This returns back a set of rows, each row being an array of 2 elements: the pixel value and the count of pixels having that value.
 (0,23975242)
 (1,442029)
 (2,265327)
 (3,42985)
Let's do this again, but this time clipping to a polygon:
SELECT ST_ValueCount(
ST_Clip(rast,1, 
    ST_Transform(ST_GeomFromText('POLYGON((521879 5011246,522182 5011140,521751 5010750,521751 5010966,521358 5010966,521879 5011246))',26910), 3174)
    ,1,false
)
AS valuecount FROM nlcd2005;
All this did, was replace "rast", the first parameter to ST_ValueCount(), with a call to ST_Clip() which clips a raster to a given geometry, then returns the resulting raster. Simple.


Processing The Value Counts

A lot of the raster functions return a set of rows, and these are a nuisance to work with. You could tease them apart like this:
SELECT (ST_ValueCount(rast,1,false)).value AS pixel, 
    (ST_ValueCount(rast,1,false)).count AS howmany
    FROM nlcd2005;
But this has two problems: a) It runs the value count twice, which takes twice as long; and b) those field names within the record are poorly documented at this time, so you may take some guessing or reading the source code to find them.

Eventually I gave up on this, and took the easy way out: regular expressions. Given that it's a set of numbers inside parens, it's pretty simple to just process it within PHP where we would be handling these row results anyway:
$counts = array();
$bits = null;
preg_match('/\((\d+)\,(\d+)\)/', $pxcount->valuecount, $bits );
$pixel = $bits[1];
$count = $bits[2];
$counts[$pixel] += $count;
And now you have an associative array of pixel values onto pixel counts. From here, follow your usual techniques to figure out acres, percentages, or whatever you had in mind.


Pixel Counts To Acres

The easiest way to find acres, is to multiply the pixel count by the known size in meters/feet, then by the conversion factor to get acres. Like this:
$acres_nlcd_11 = $counts['11'] * 30 * 30 * 0.000247105;
If you don't know the pixel size of this specific raster, PostGIS Raster can tell you: ST_PixelWidth() and ST_PixelHeight() It won't tell you the units, though.


Monday, September 2, 2013

agoodle for finding pixels above/below a threshold

In my previous post, I used agoodle to derive an associative array of pixel values onto pixel counts, and used this to calculate land cover statistics for a polygon. This was using classified data: pixels fall into one of these 15 specific pixel codes, period.

For a different project, we needed to calculate the min/max/mean of the values within a given polygon. This isn't classified data, but floats, so the histogram idea covered last week may not be the best approach. Second challenge: we need to do this for 12 rasters, and at 1-2 seconds apiece this meant some 20 seconds, and what can we do to speed it up?



The Program

Well, let's skip straight to the code.
#!/usr/bin/python
"""
Analyze a set of several rasters, clipping to a given geometry in the same SRS as the raster.
Return a JSON structure containing the min/max/mean of values for each raster.

- SRS of the given geometry must be specified as a command-line parameter
- Simplifying the geometry is highly advised, as it makes it run a lot faster.
- agoodle can consume a lot of RAM

# Newberg Oregon
python climatestats.py 26910 "MULTIPOLYGON(((502664 5019820,502895 5019229,504881 5018580,504883 5017770,505596 5017772,505165 5017463,505352 5016998,505830 5016998,505830 5016074,505461 5016065,505460 5015669,504591 5015666,504665 5016153,504352 5016157,504349 5014760,503990 5014719,503990 5014246,503901 5014697,503011 5014503,502600 5014722,501617 5014157,501164 5014523,501661 5014877,501204 5015506,501365 5015770,500791 5016133,501262 5016711,500838 5016596,501159 5016982,501019 5018184,501266 5018179,501010 5018557,501268 5018554,501269 5019571,502664 5019820)))"

# Barlow Oregon
python climatestats.py 26910 "POLYGON((521879 5011246,522182 5011140,521751 5010750,521751 5010966,521358 5010966,521879 5011246))"
"""

# the list of rasters to process, and what key to store them in
# in the output structure, e.g. OUTPUT_STATS['tmin_2010']
TIFFS = [
    'tmin_2010' : '/projects/oregon/rasters/min_temperature_current.tif',
    'tmin_2040' : '/projects/oregon/rasters/min_temperature_2040.tif',
    'tmin_2070' : '/projects/oregon/rasters/min_temperature_2070.tif',
    'tmin_2100' : '/projects/oregon/rasters/min_temperature_2100.tif',
    'tmax_2010' : '/projects/oregon/rasters/max_temperature_current.tif',
    'tmax_2040' : '/projects/oregon/rasters/max_temperature_2040.tif',
    'tmax_2070' : '/projects/oregon/rasters/max_temperature_2070.tif',
    'tmax_2100' : '/projects/oregon/rasters/max_temperature_2100.tif',
    'ppt_2010' : '/projects/oregon/rasters/precipitation_current.tif',
    'ppt_2040' : '/projects/oregon/rasters/precipitation_2040.tif',
    'ppt_2070' : '/projects/oregon/rasters/precipitation_2070.tif',
    'ppt_2100' : '/projects/oregon/rasters/precipitation_2100.tif',
]

# the SRS of the rasters
TIF_EPSG = 900913

# the NODATA value used in the TIFFs
# yeah, all of them must have the same NODATA value
NODATA = -9999

# the structure for the final collected statistics from all TIFFs
# this is a global so it can be written by multiple threads,
# cuz that what we;'ll be doing to make this thing faster
OUTPUT_STATS = {}

##############################################

import simplejson, sys, os
from osgeo import ogr, osr
import numpy as np
import agoodle
import threading
import time

# cribbed from agoodle.py
# given a WKT which we know is a single, simple polygon,
# pull out the vertex points as a numpy-compatible array, and the bounding box in raster SRS
def points_from_wkt(wkt, from_epsg=900913, to_epsg=TIF_EPSG):
    # unpack the WKT into a geometry
    g = ogr.CreateGeometryFromWkt(wkt)

    # knowing that it's a polygon, get the pixels of the outer ring
    ring = g.GetGeometryRef(0)
    pts = []
    for i in range(ring.GetPointCount()):
        x, y, z = ring.GetPoint(i)
        pts.append((x, y))
    pts = np.array(pts)
    bbox = (pts[:, 0].min(), pts[:, 1].min(), pts[:, 0].max(), pts[:, 1].max())
    return pts, bbox


def get_minmaxmean(tiff,wkt,from_epsg):
    # create the agoodle raster handle, reproject the polygon to match the raster SRS
    raster = agoodle.AGoodle(tiff)

    tif_srs = osr.SpatialReference()
    tif_srs.ImportFromEPSG(TIF_EPSG)

    poly_srs = osr.SpatialReference()
    poly_srs.ImportFromEPSG(from_epsg)

    polygon = ogr.CreateGeometryFromWkt(wkt)
    polygon.AssignSpatialReference(poly_srs)
    polygon.TransformTo(tif_srs)

    # collect the rings to analyze; for a single polygon this is dead simple
    rings_wkt = []
    if wkt[0:7] == 'POLYGON':
        # plain polygon, use the WKT as-is
        rings_wkt.append(wkt)
    elif wkt[0:12] == 'MULTIPOLYGON':
        # a multipolygon, we iterate over the component Polygons and add them to the list
        for i in range(0,polygon.GetGeometryCount()):
            ring = polygon.GetGeometryRef(i)
            rings_wkt.append( str(ring) )

    # go over the rings,m each a simple Polygon geometry in WKT, and in the raster's SRS
    # clip to the bbox then to the polygon bounds; cribbed from agoodle.py, a variant of what's done in summarize_stats()
    # for each resulting numpy array, pull the min, max, and mean and exclude and 0-NODATA values
    # and append to the array. yes, minimum is not a minimum but an array of minimum values, one per ring
    # at a later phase we find the mean of means, the maximum of maximums, the minimum of minimums, etc.
    minimum = []
    maximum = []
    average = []
    for ring_wkt in rings_wkt:
        pts, bbox = points_from_wkt(ring_wkt)
        ag = raster.read_array_bbox(bbox)
        number_array = ag.mask_with_poly(pts, copy=True, mask_value=NODATA)

        # filter out the NODATA cells
        # then filter out 0s if we're doing a temperature dataset (they screw up the means for temperature only)
        # then bail if we in fact have no match at all (outside the area)
        number_array = number_array[number_array != NODATA]
        if not len(number_array):
            continue

        average.append( number_array.mean() )
        maximum.append( number_array.max() )
        minimum.append( number_array.min() )

    # the minimum, maximum, and average are now lists, of the min/max/mean per ring
    # pull the average of the averages, the maximum of the maximums, etc. and we're done
    minimum = min(minimum)
    maximum = max(maximum)
    average = float(sum(average)) / len(average)
    return { 'min':float(minimum), 'max':float(maximum), 'mean':float(average) }


def process_tiff_in_thread(tiff,key,wkt,poly_epsg):
    # the TIFF and SRS are required for analysis; the key is so we can put the results into the output structure
    these_stats = get_minmaxmean(tiff, wkt, poly_epsg)
    OUTPUT_STATS[key] = these_stats['mean']

    # some profiling
    #print repr([ 'Thread done', tiff, time.strftime("%H:%M:%S") ])



if __name__ == '__main__':
    try:
        epsg = int( sys.argv[1] )
    except IndexError:
        print "Specify the numeric ID (SRID) of the SRS of the geometry"
        print "Use only the number, e.g. 3857 for Web Mercator"
        sys.exit(1)
    try:
        wkt = sys.argv[2]
    except IndexError:
        print "Supply a WKT geometry to compare against the raster."
        print "WKT geometry must be in EPSG:%d" % SRID
        sys.exit(1)

    # some profiling
    #print "Processing started %s" % time.strftime("%H:%M:%S")

    for key,tiff in TIFFS.items():
        run_thread = threading.Thread(target=process_tiff_in_thread,args=(tiff, key, wkt, epsg ))
        #print repr([ 'Thread started', tiff, time.strftime("%H:%M:%S") ])
        run_thread.start()
    # keep looping until the last thread exits
    while threading.activeCount() > 1:
        time.sleep(1)

    # some profiling
    #print "Processing finished %s" % time.strftime("%H:%M:%S")

    # done, encode as JSON and spit it out
    print simplejson.dumps(OUTPUT_STATS, True)

Discussion


That command-line program accepts a SRID and a geometry, and will reproject the stated geometry into the SRS of the raster. This is different from my previous posting, which did no repropjection. It will handle both simple polygons and multipolygons, accumulating a list of stats for each ring and then taking the maximum of maximums, minimum of minimums, etc. (does anyone say "maxima" as the plural of maximum?) And the output is again a JSON structure.

Because there are 12 rasters to process and each one takes 1-2 seconds, I made it multithreaded. All 12 rasters will be processed in parallel, so the total time is closer to 4 seconds than to the 18 that would result if they were done sequentially.

Because of the multithreading, OUTPUT_STATS is kept in the global scope, so all threads may write to it.


Thursday, August 22, 2013

Raster analysis in a web application, using agoodle

It's been over 2 months since my last post. Too busy working, to stop and write about work. But that sure beats the alternatives, yes?

Today's post, is one of two on the topic of doing raster analysis in a web application. Specifically, the need is as follows:
Given some back-end raster such as NLCD land cover codes, and given a polygon such as a custom drawing or a county boundary, fetch statistics from that raster such as the acres of land fitting each land cover code.

agoodle


agoodle is a Python library, written by Brent Pedersen (https://github.com/brentp/agoodle) It is based on the GDAL binding for Python and on numpy.

GDAL - The famous library for accessing raster datasources in a variety of formats. Its Python binding allows you to open raster datasets in Python, then do operations on them such as reprojection and querying by bounding box.

NumPY - Numeric Python is a Python library for handling multidimensional arrays of numbers. A raster being a two-dimensional array of numeric values, seems made for this sort of thing.

agoodle - As Brent says "Agoodle: GDAL + numpy" A single Python file, providing some useful interfaces for loading a GDAL-accessible raster into numpy, then querying it, e.g. by finding pixels which intersect a polygon.

Installation of agoodle is self-explanatory. Really, it's one .py file, so if don't want to make a system-wide change you can simply take the one agoodle.py file.


A feature and a limitation


agoodle has a single function that comes in quite handy in this case: summarize_wkt()  This function accepts a geometry in WKT format, which must be in the same SRS as the raster being processed. It returns a dictionary (associative array) of pixel values onto pixel counts. For the NLCD task, this is almost exactly what we want, right off the bat.

There is one shortcoming of summarize_wkt() though: it only accepts a single POLYGON geometry, not multipolygons. No donuts either. Practically every shape we deal with is a multipolygon,so I had to come up with a workaround.


Second limitation


Agoodle is memory-intensive, and it has a built-in limitation to keep from consuming too much RAM and potentially tanking the server. Look in agoodle.py, for the mask_with_poly() function. That assert statement is the limit: if there are too many pixels, it will fail and terminate.

For ourselves, we raised the assertion to 10,000,000 pixels. Under some conditions, it will consume up to 6 GB of RAM for a few moments as it processes.

So by default, agoodle only works on smaller queries. It can be overridden, but you may want to invest in extra RAM.


A CLI program to print a histogram


The following command-line program accepts a geometry in WKT format, and prints out a JSON-encoded associative array of pixel values and the count of each. This is called from PHP using the shell_exec() operator.

#!/usr/bin/env python
"""
A command-line tool to analyze the raster clipped to a geometry

and print out a JSON-encoded associative array of pixel=>count
 

- The submitted geometry MUST be in the same SRS as the target raster. See those definitions below.
- The geometry MUST be simplified to only a few hundred vertices, or else you risk tanking the server by consuming 6 GB of RAM.

# Newberg Oregon, in the stated SRSEPSG:26910 (UTM zone 10 WGS84 datum)
python histogram_groundcover.py "MULTIPOLYGON(((502664 5019820,502895 5019229,504881 5018580,504883 5017770,505596 5017772,505165 5017463,505352 5016998,505830 5016998,505830 5016074,505461 5016065,505460 5015669,504591 5015666,504665 5016153,504352 5016157,504349 5014760,503990 5014719,503990 5014246,503901 5014697,503011 5014503,502600 5014722,501617 5014157,501164 5014523,501661 5014877,501204 5015506,501365 5015770,500791 5016133,501262 5016711,500838 5016596,501159 5016982,501019 5018184,501266 5018179,501010 5018557,501268 5018554,501269 5019571,502664 5019820)))"

# Barlow Oregon, in the stated SRS
python histogram_groundcover.py "POLYGON((521879 5011246,522182 5011140,521751 5010750,521751 5010966,521358 5010966,521879 5011246))"
"""

# the raster file
RASTERFILE = "/projects/oregon/rasters/nlcd2005.tif"

# the SRS of the raster

# EPSG:26910 (UTM zone 10 WGS84 datum)
SRID = 26910

##########################################################################################

import sys
from osgeo import ogr, osr
import agoodle, numpy
import simplejson

def process_raster_wkt(rasterfile,wkt):
    raster = agoodle.AGoodle(rasterfile)

    srs = osr.SpatialReference()
    srs.ImportFromEPSG(SRID)

    polygon = ogr.CreateGeometryFromWkt(wkt)
    polygon.AssignSpatialReference(srs)

    histogram = {}

    # if it's already a simple Polygon, we can analyze it as-is
    # this being the only ring-histogram, we can update the total histogram as-is
    if wkt[0:7] == 'POLYGON':
        ringhist = raster.summarize_wkt(wkt)
        for pixel,count in ringhist.items():
            histogram[ str(pixel) ] = count + histogram.get(pixel,0)
    # a multipolygon, we iterate over the component Polygons and add them together
    elif wkt[0:12] == 'MULTIPOLYGON':
        for i in range(0,polygon.GetGeometryCount()):
            # go over this ring's vertices...
            # if the input geometry is a simple Polygon, then the ring is in fact a LinearRing already
            # if it's a MULTIpolygon then the ring is really a Polygon, and we dive a level deeper to get at the LinearRing
            ring = polygon.GetGeometryRef(i)
            ringhist = raster.summarize_wkt( str(ring) )
            for pixel,count in ringhist.items():
                histogram[ str(pixel) ] = count + histogram.get(pixel,0)
    else:
        print "Unsupported geometry given."
        print "Only POLYGON and MULTIPOLYGON are supported."
        sys.exit(2)

    # cast all of the counts to integers; for some reason agoodle returns the counts as floats
    for key in histogram.keys():
        histogram[key] = int( histogram[key] )

    # done, hand it back
    return histogram


if __name__ == "__main__":
    try:
        wkt = sys.argv[1]
    except IndexError:
        print "Supply a WKT geometry to compare against the raster."
        print "WKT geometry must be in EPSG:%d" % SRID
        sys.exit(1)

    # calculate the pixel counts, and spit out as JSON
    histo = process_raster_wkt(RASTERFILE,wkt)
    print simplejson.dumps(histo)

A few caveats:
  • It does not attempt to reproject your geometry from/to any particular SRS, it is presumed to be in the correct SRS already. For us, this is rarely a real issue, as we can bounce it off of PostGIS if necessary, or take a copy of the raster in the necessary SRS.
  • With very large areas, the memory consumption may be excessive (see above). This solution may not work for you at all if you need to do large regions (large numbers of pixels).
  • It really helps to simplify the polygon before using agoodle on it. The calculation of vertices does take some time, and simplification makes it a lot faster.


Putting it together: Querying to find acreage of land covers

The web application is written in JavaScript and HTML, and server-side components use PostGIS and PHP.
  • When the user picks an area from the list, they are given the (simplified) geometry of the area in WKT format, which is used client-side for highlighting.
  • The user can also draw an area, which has much the same result: an OpenLayers.Feature.Vector exists in the browser.
  • When the user wants to get statistics, this geometry is passed to the server via a POST request.
  • That POST endpoint accepts the geometry, and passes it to PostGIS for two things: 1) a simplified version to 500-meter resolution using ST_SimplifyPreserveTopology() and 2) the ST_Area() of the geometry in local UTM.
  • The POST endpoint uses shell_exec() to call this CLI program, gets back JSON, and parses it. This is simply $stats = json_decode( shell_exec("python histogram.py $geom") )
  • The POST endpoint can calculate the total number of pixels by adding up array_values($stats) Calculating acres is easy: the portion of total pixels fitting each pixel value, is the proportion of the total acres calculated earlier. e.g. 1,500,000 pixels total and landcover code 11 is 300,000 pixels, means that 0.20 (20%) of the total acreage is landcover code 11.
An alternative approach to calculating the acres, is to know the resolution of the pixels, e.g. 30 meter squares. Given that, you can instead multiply the number of pixels by 30*30 to get square meters, then by a ratio of square meters to acres (0.000247105).

Thursday, June 6, 2013

Charting with D3

After months of wishing we had a good project to try out the D3 charting library, it has finally happened. We do a few charts now and then, but finally a really big charting project has come up. At this point, it's nothing fancy: just some bar charts and pie charts, but it's finally an opportunity.

For the Impatient


If you're impatient, you can skip straight to my Github repo where I have posted some wrapper functions to make D3 easier to work with for our simple use cases.

https://github.com/gregallensworth/Charts_D3

The motivation behind these wrappers, was that I found myself copying and pasting the same 25 lines of code and changing little more than the data and the name of a field. By the time I created a technique for adding tooltips, each function for laying down a pie chart was some 75 lines of code. In our case, every pie chart and every bar chart are basically identical in their needs and a wrapper made a lot of sense.


D3 is Raw


D3 is tough to work with, in that it's not really a charting library but more of a SVG toolkit. A charting library, you declare a list of values for the X and Y axes, pick a color, and get a chart. With D3 it's dozens of lines of defining radii, figuring out how much space to trim off and then positioning to the center of the SVG canvas. It's very much a raw nuts-n-bolts experience, which I normally like... but in the case of a ready-to-draw chart library, not so much.

Being a very raw library, you have to write a lot of the math yourself. The placement of the labels on pie charts, the height of rectangles to create the illusion of a bar chart, ... most of that is written with D3's help but is hardly a "here's data, here're colors, draw a chart" sort of function.

Thus, the wrappers.


Tooltips


The existing samples lacked tooltips, so I had to invent those using the mousemove and mouseout events. I cheated, and did them in jQuery instead of learning enough SVG to emulate the same effect in "more pure" D3 & SVG.

For us, we use jQuery in practically everything so this wasn't a bad way to go.


IE8 and below


D3 does not run in IE8. But R2D3 does. This alternate version of D3 works in IE8, and with some conditionals you can have modern browsers run D3 proper, and have IE8 use the compatible version.

So far  we haven't hit on a case where R2D3 fails to do what D3 does. I'm sure it's not perfect, but at this point we haven't found an imperfection. Neat, eh?


Printing / Saving to PDF


Being a purely client-side phenomenon, drawing charts in D3 makes it difficult to link directly to the chart graphic, or to embed the graphic into a PDF.

For our "Save as PDF" needs we largely use WKHtmlToPDF, which reads a HTML document and renders it as a PDF via Qt and the Webkit engine. It supports JavaScript too, so we largely write the PDF template in HTML and JavaScript -- which can include D3.


Conclusion


An interesting exercise. D3 definitely is not a ready-to-run chart library, but obviously it has a lot of promise if you have significant development time to write code in SVG. Now that I've done the initial investment of developing the wrappers, future pies and bars will be a lot easier.

Friday, May 24, 2013

These last two days I've been at FOSS4G NA. As expected it was educational and fun. A few highlights.

LEAFLET - This has been a long time coming: Leaflet Rocks. Some 8 months ago my first posts were about Leaflet's shortcomings in a very complex situation, but Leaflet 0.5 and less-complex apps since then turned the tables entirely. I'm eating my words, and loving it.

OPENLAYERS 3 - Looks exciting. The API doesn't look too gnarly, it's still a slippy map, and the rendering speeds look exciting. It was just a preview but looks promising.

CESIUM - A modern replacement for the Google Earth Plugin, which even has a link to Google Frame so it will work with IE. I gotta take a look at this one, I'd love to see more "spinny maps"

JAVASCRIPT FRAMEWORKS - Travis Webb of NBT gave a talk about a JavaScript framework
 called Enyo. He compared jQuery to Perl, with which I agree: quick, effective, spaghetti. Enyo looks too far in the other direction: write the app entirely in JavaScript structures and LESS, compile it into HTML and CSS. Still, it warrants some investigation and perhaps *a little* more structured environment in our client-side JavaScript apps wouldn't be a bad thing. Maybe Backbone.js would give us the best middle ground of *allowing* structure while still *allowing* us to break the rules when we need it. (which is what I adore about CodeIgniter, it's the best of both)

GEOTXT - Frank Hardisty at Penn State turned us on to GeoT, a service which parses lengthy narrative text, finds named locations and geocodes them. There's a REST API returning a JSON structure. We have clients who would find this immediately useful.

My own lightning talk, about the hundreds of pieces in the Cleveland Metroparks app, was comparatively dry and technical. I thought that's what they wanted, but now I know better and my next LT will be less tech, fewer pieces, more animation.

Also, it's just great to be in Minnesota again. I was here for OSG 2005 over at UMN, and it's still a beautiful city with great public transit.

Thank you, FOSS4G NA organizers and presenters. Great time.

Wednesday, March 13, 2013

CodeIgniter sessions: timeouts, AJAX, and Remember Me

I love CodeIgniter. It's among the top 3 popular PHP frameworks, though the order of those 3 is a matter of opinion and taste.

Back in February I finally got to research and work around an issue that had been driving me nuts. I have an app which requires login, and uses CI sessions. Using one particular page, after a while my AJAX calls will fail because I am no longer logged in. Reload the page, yep, I'm logged out.

Turns out, this is a known issue, going way back:
CodeIgniter regenerates a new session ID periodically, to prevent session ID hijacking (session fixation attacks). At some point my session ID will change, and my session will effectively expire because the AJAX requests aren't setting the new session ID in my browser.

https://github.com/EllisLab/CodeIgniter/issues/154

This goes beyond AJAX, though. If you want the session to stay open over time, e.g. a Keep Me Logged In functionality, you will also see this problem: when you open the site tomorrow, your session ID is out of date and you're not Keeping Logged In at all.


Fortunately, the fix for this is not difficult.

Session Expiration & Session ID Regeneration


Open up config.php and look for these two session variables.
The sess_expiration setting, defines how long a session is good (idle time since last hit).
The sess_time_to_update setting, defines how often your session ID will be regenerated.
$config['sess_expiration'] = 86400 * 30; // 30 days
$config['sess_time_to_update'] = 300; // 5 minutes
Our problem is the session ID regeneration. We can effectively disable this by making sess_time_to_update very, very long:

// sessions are good for 30 days, and IDs will change every year
$config['sess_time_to_update']    = 86400 * 365;




Problem solved. I only need to login after 30 days of idle time, and even using AJAX-heavy pages without

Consequence: Session Fixation

Disabling session ID regeneration does have a consequence for security, though. Now that your session ID doesn't change, it's hypothetically possible for someone to snag your session ID and use it indefinitely (well, for a year if you set it as above).

This can be mitigated by a few other arrangements:

Shorten the session expiration and regeneration time. Do they really need a session to be good for 30 days? Would 7 days suffice? How about 3 days? The shorter it is, the less time you give a sniffer to try hijacking your session before finding it already expired.

Encrypt the cookies! Set sess_encrypt_cookie to TRUE and set a random encryption_key so thesession IDs can't be intercepted in the first place. Do a Google search for "codeigniter encryption key generator" and take your pick.


Enable user agent matching. The sess_match_useragent setting enables a check on your session, that you're using the same browser as you were when the session opened. User agent strings can be faked, but this does increase the "cost" of brute forcing or hijacking sessions, as the hacker must try the session key several times as they guess browsers.

Enable IP address matching. The sess_match_ip setting enables a check that your IP address hasn't changed since your session was last used. This one may not be appropriate if your ISP routes traffic through changing proxy servers (satellite Internet, America Online) and most of us have dynbamic IP addresses that change every few days. But, if your use case is access from a fixed IP address such as an office intranet, or you can tolerate logging back in after your reset your home router, this can provide a significant layer of security.

Use SSL. This isn't always possible, but when it is, intercepting sessions is impossible. If you use this and also cookie encryption, then intercepting sessions is doubly impossible.

Conclusion

This solution to session timeouts works fine, but opens up a security consideration. Fortunately, with cookie encryption, this security consideration doesn't need to be a security problem and your users can Keep Me Logged In without having their sessions hijacked.

Monday, March 11, 2013

MobileMapStarter update: Tile caching and Chrome compatibility

Yesterday I put the finishing touches onto a major revision of MobileMapStarter.

- The previous version used imgcache.js to passively cache map tiles as you browse. The new version uses a new system, allowing you to download a specific area for offline use. No more panning and zooming repeatedly -- just zoom out, center it, and hit Download. It even has a nice status thingy, and skips over tiles that you already have.

- I spent a lot of time getting the caching to work in Chrome too, so you should be able to prototype your app on the desktop.

- There were also some other minor things: typos, minor adjustments, and changing out the Streets basemap for the Mapbox Satellite basemap.

Go git it!
https://github.com/gregallensworth/MobileMapStarter/





Thursday, March 7, 2013

Automating CartoDB and Google Fusion Spreadsheets on GoDaddy

A whole month without a posting. I've been busy on some fun stuff here. Here's a peek.

A client wanted something simple. They have a registration form which logs submissions regarding trees, to a Google Fusion Spreadsheet. They want a choropleth map of Census Block Groups (CBGs), indicating how many trees had been registered there. But it's not that simple: they don't want to pay for geospatial hosting, they want to stick with their $8 GoDaddy hosting.


So, our basic needs are:
  1. Update the Google spreadsheet of trees, populating the lat and lon fields for any which don't have them.
  2. Have a PostGIS table of CBGs someplace for free.
  3. Perform a spatial join between the CBG polygons and the  tree points.
  4. Render it as a Leaflet map, using only client-side tech, knowing that the CBGs are too much to be communicated as raw vectors and rendered as a L.Polygon layer.
Our resources are:
  1. The Google Spreadsheet.
  2. A website at GoDaddy which supports PHP, but not many modules.
  3. Plenty of disk space.
  4. A free account at CartoDB.
It all worked out beautifully, using client-side-only mapping, and PHP with no special configurations, suitable for hosting on GoDaddy's low-priced web hosting. No server-side mapping infrastructure!

An annotated version of the app's source is available here:
http:/tilelab.greeninfo.org/~gregor/cartodb_google/


Step 1: GoogleFusionTable.php


Step 1 is comparatively easy. Google Docs already has a web API for doing SQL-like queries to spreadsheets, and a PHP library already exists for this.

https://github.com/gregallensworth/CloudServices/blob/master/GoogleFusionTable.php

This was just simple: Google's geocoder service, the GFT PHP, and a few dozen lines of code. Now we just visit the URL and all non-geocoded registrations will have their Lat and Lon fields updated.



Steps 2: Uploading Google to CartoDB


We had been itching to try out CartoDB's offerings, this was a great opportunity.

CartoDB is a website, which offers an easy-to-use interface to PostGIS tables. You upload a spreadsheet or a shapefile, they grok it and load it, and you get a nice table UI and a panel for entering SQL. A step further, they have a web API for performing these SQL queries, so you can get back a JSON object full of rows, by submitting a query via GET or POST.

The heart of accessing it via PHP, boils down to this wrapper function I whipped up:
https://github.com/gregallensworth/CloudServices/blob/master/CartoDB.php

And the implementation is just a tie-together of Google's SQL API for Fusion Docs, and CartoDB's SQL API to your table:
https://github.com/gregallensworth/CloudServices/blob/master/google2cartodb.php

Remember: Each query really is a hit to CartoDB's web service. If you have 12,000 rows to insert, you really will make 12,000 hits to CartoDB. Be kind and thoughtful. Use a common key between your two tables, so you don't need to re-insert every single record if it's already there. And consider doing the initial load via CartoDB's dashboard.


Step 3: Spatial join via CartoDB's web SQL API


Pretty simple here, now that you've already seen the CartoDB web SQL API.
Just run these two queries:

DELETE FROM census_block_groups_treecounts;

INSERT INTO census_block_groups_treecounts (
    the_geom,
    treecount
)
SELECT
    blocks.the_geom,
    SUM(treeshere) AS treecount
FROM census_block_groups AS blocks
JOIN registrations AS trees
ON ST_Contains(blocks.the_geom, trees.the_geom)
GROUP BY blocks.cartodb_id;


Step 4: Client side map... CartoDB again


Further still, CartoDB created a JavaScript library which bundles Leaflet, their web SQL API, mouse behaviors, CartoCSS parsing, and a dozen other great things. So getting your data back out of CartoDB and into Leaflet requires very little code.

http://developers.cartodb.com/documentation/cartodb-js.html

http://tilelab.greeninfo.org/~gregor/cartodb_google/index.js

Not a lot to say here. The docs are less than rich in real-use-case examples, and CartoDB.js brings together a lot of moving parts. But once you figure it out, it's really something quite nice.

BONUS: CartoDB layers can have their CartoCSS changed at runtime, and have their SQL changed at runtime. This app doesn't make use of that, but in theory filtering features or changing color schemes, has never been easier.


Step 5: Nightly Automation


GoDaddy has a facility for setting up cronjobs, right there in the control panel. At 1am it runs the geocode PHP. At 2am it runs the google2cartodb PHP. At 3am it runs the spatial join. Pretty groovy.


Trouble and Surprises


Not a lot in the way of trouble and surprises.

GoDaddy's PHP has a execution time limit. This seems to be 5 minutes, and ini_set() and set_time_limit() don't seem to change it. As such, the steps really are handled as separate PHPs instead of one monolithic one to geocode, upload, and join. This is probably for the best anyway, as it's easier to debug individual parts this way.

CartoDB's free account limits you to 5 MB of storage. In the case of this client, their registrations came up to 1 MB even if we stripped down to only the Lat & Lon, and they are expecting plenty more registrations . The set of CBGs comes up to 2 MB for the SF Bay Area, and that's after I stripped out areas that were unlikely to get registrations, then simplified, and clipped to fit land. The joined layer displayed on the map, will vary in size based on the number of matching CBGs; there's a maximum of 2 MB here too, if some day every single CBG were to have at least one registered tree. As such, hitting the 5 MB limit seems very likely as they get more registrations and they may end up upgrading anyway.

CartoDB's web SQL API cannot create tables, only populate them. I had to upload a spreadsheet of the first few registrations, for example, and has to manually run the spatial join SQL in CartoDB's dashboard. After the table structure was there, a DELETE FROM followed by later INSERTs worked just perfectly.

CartoDB.js involves a lot of parts, and the docs are good but not always great. Though the final map involves fairly little code, figuring out exactly which code took some time. I had to read up on CartoCSS a bit, and had to read a few times to figure out that CartoDB's API won't return the_geom but that you must do query for "ST_ASGEOJSON(the_geom) as json_geom" to get it, e.g. if you want to highlight features.


Conclusion


A client wanted a geospatial app with some moderate back-end complexity, but without paying for a geo infrastructure. Using a Google-hosted contact "database", CartoDB's free tier plan, and a pinch of PHP to tie them together, it all came together beautifully.

In the meantime, I learned a lot about CartoDB and really admire what they've come up with. The $30 price tier is a bit steep for those intermediate use cases: this client may go up to 15 MB of usage, over the free tier but well under the need for the Magellan tier. If they had a smaller paid plan that would be really excellent.

Friday, February 8, 2013

Shrinking Well Known Text (WKT) Geometries By Rounding Decimals


It's been a good month, full of fun stuff, but nothing that REALLY seemed blogworthy. But today, someone suggested an idea.

Background: When communicating a geometry from the server to the browser, we often use Well Known Text (WKT). They're not overly verbose and they're simple and fast to generate.

Issue: WKT geometries can still be quite large. Simplifying is a great step if you can afford the introduced inaccuuracy, but still, can we make them smaller?

Solution: Round the decimals.

If you are using a SRS with meters or feet as units, then you can probably afford for your geometries to use whole numbers instead of decimals. After all, what's the rounding error? A difference of 0.999 meters isn't much, considering the accuracy of handheld or automotive GPSs.

Check out this sample WKT, the first 6 vertices of the polygon of Salinas County, California.

-- ST_ASTEXT()
 POLYGON((-13578594.747402 4380792.43128354,-13578592.2992804 4380792.29270892,-13578569.3701029 4380825.00366193,-13578580.720485 4380880.31712633,-13578605.4359713 4380926.761986
-- ST_ASTEXT_ROUNDED()
 POLYGON((-13578594 4380792,-13578592 4380792,-13578569 4380825,-13578580 4380880,-13578605 4380926,-13578607 4380932

Overall, the reduction is just under 50%, from 598 KB to 309 KB. This is cumulative with ST_SIMPLIFY() too. If you can afford to cut corners (a little geospatial pun there, ha ha), you can simplify and then round it to get 50% of even the simplified payload.


Neat. How do I do it?


I created a handy function ST_ASTEXT_ROUNDED() which simply complements ST_ASTEXT() It works for PostGIS 1 and PostGIS 2

Go git it!
https://github.com/gregallensworth/PostGIS

This is really a function wrapper around a super simple regular expression which trims off decimals. It doesn't round the numbers, but crops them. But then again, with a maximum error of 0.999 meters does it really matter?


When Not To Do This


A key point of this rounding trick, is the assumption that rounding the numbers makes no real-world difference in the accuracy of your data. For feet or meters, 4380792.43128354 and 4380792 introduces an inaccuracy less than an arm's length.

If you're using geographic coordinates (WGS84, lat & lon) then it's different. A degree is 60 miles at the equator, so rounding from -122.3 to -122 is a difference of 10-18 miles depending on your latitude. Don't do that.

Saturday, January 12, 2013

Cordova File API , when getDirectory() and DirectoryReader do NOTHING!


Naturally, within a day of my first successes with PhoneGap Build, I am trying some decidedly non-trivial stuff already. I want map tiles off the Internet via HTTP, but I want them cached so the tiles will still transparently be available when I lose connectivity. My plan, is to use Cordova's File storage API.

I almost immediately hit upon an issue:

The getDirectory() method does NOTHING. Neither the success nor failure callback happen. No exception is raised. This happens in the app as it loads, and also in the weinre Console. It doesn't crash the browser, hang the iPad, create a directory, ... Nothing.

Figuring this out wasted 2 hours of my time, and in retrospect the cause seems almost trivial, almost expected, although short of obvious:

File API's getDirectory() method, cannot recursively create directories, creating required parent directories. It returns instantly, but silently fails, without a callback nor an exception.

// the set of callbacks for the two phases: requesting a handle to the filesystem,
// and requesting access to (or creation of) the specified target directory
var getFSsuccess = function(fs) {
    console.debug('Got fshandle');
    FS = fs;
    FS.root.getDirectory(tiledir, {create:true,exclusive:false}, getDIRsuccess, getDIRfail);
};
var getFSfail = function () {
    throw new Error('Could not open filesystem');
};
var getDIRsuccess = function (dir) {
    console.debug('Got dirhandle');
    cachedir = dir;
    fileurl  = fs.root.fullPath + '/' + tiledir;
};
var getDIRfail = function () {
    throw new Error('Could not open directory ' + layerinstance.options.tiledir);
};


// WORKS
// getFSsuccess will be called,
// will call FS.root.getDirectory with tiles-greeninfo.terrain
// and create the subfolder as expected
var FS, cachedir, fileurl;
var tiledir = "tiles-greeninfo.terrain";
window.requestFileSystem(LocalFileSystem.PERSISTENT, 0, getFSsuccess, getFSfail);

// FAILS
// This will not create tiles/ and then tiles/greeninfo.terrain
// as it's a subfolder and getDirectory doesn't do parent creation
// The catch is that it's SILENT failure:
// the getDIRfail callback will never happen, and no exception will be raised
var FS, cachedir, fileurl;
var tiledir = "tiles/greeninfo.terrain";


If you see this blog posting in Google, and it saves you some time, let me know. I'll be glad to know that someone benefited from my wasted evening. :)

Monday, January 7, 2013

MobileMapStarter


A trivial mobile mapping app, tying together the best of Leaflet, jQuery Mobile, and Phonegap/Cordova. This will form the framework for our future generations of mobile apps at GreenInfo Network.

https://github.com/gregallensworth/MobileMapStarter

For the narrative, I'll simply quote from the README file:

A starting framework for mobile maps using Cordova/Phonegap. A minimal but functional, standalone mobile app from which to build your own creations.
This app is designed for Phonegap/Cordova, therefore it is HTML, JavaScript, and CSS.
Components of this app:
  • HTML/CSS/JS layout -- The app is ready to compile and run via Phonegap.
  • config.xml -- The app is ready to upload to Phonegap Build. The included config.xml specifies permissions, icons and splash screens, and more in an easy-to-edit template.
  • jQuery Mobile -- A mobile-style user interface theme. Includes jQuery which makes JavaScript useful.
  • Leaflet -- Quick, pretty, easy tiled maps.
  • imgcache.js -- Cache Leaflet tiles to device storage, for offline use.