Thursday, August 22, 2013

Raster analysis in a web application, using agoodle

It's been over 2 months since my last post. Too busy working, to stop and write about work. But that sure beats the alternatives, yes?

Today's post, is one of two on the topic of doing raster analysis in a web application. Specifically, the need is as follows:
Given some back-end raster such as NLCD land cover codes, and given a polygon such as a custom drawing or a county boundary, fetch statistics from that raster such as the acres of land fitting each land cover code.

agoodle


agoodle is a Python library, written by Brent Pedersen (https://github.com/brentp/agoodle) It is based on the GDAL binding for Python and on numpy.

GDAL - The famous library for accessing raster datasources in a variety of formats. Its Python binding allows you to open raster datasets in Python, then do operations on them such as reprojection and querying by bounding box.

NumPY - Numeric Python is a Python library for handling multidimensional arrays of numbers. A raster being a two-dimensional array of numeric values, seems made for this sort of thing.

agoodle - As Brent says "Agoodle: GDAL + numpy" A single Python file, providing some useful interfaces for loading a GDAL-accessible raster into numpy, then querying it, e.g. by finding pixels which intersect a polygon.

Installation of agoodle is self-explanatory. Really, it's one .py file, so if don't want to make a system-wide change you can simply take the one agoodle.py file.


A feature and a limitation


agoodle has a single function that comes in quite handy in this case: summarize_wkt()  This function accepts a geometry in WKT format, which must be in the same SRS as the raster being processed. It returns a dictionary (associative array) of pixel values onto pixel counts. For the NLCD task, this is almost exactly what we want, right off the bat.

There is one shortcoming of summarize_wkt() though: it only accepts a single POLYGON geometry, not multipolygons. No donuts either. Practically every shape we deal with is a multipolygon,so I had to come up with a workaround.


Second limitation


Agoodle is memory-intensive, and it has a built-in limitation to keep from consuming too much RAM and potentially tanking the server. Look in agoodle.py, for the mask_with_poly() function. That assert statement is the limit: if there are too many pixels, it will fail and terminate.

For ourselves, we raised the assertion to 10,000,000 pixels. Under some conditions, it will consume up to 6 GB of RAM for a few moments as it processes.

So by default, agoodle only works on smaller queries. It can be overridden, but you may want to invest in extra RAM.


A CLI program to print a histogram


The following command-line program accepts a geometry in WKT format, and prints out a JSON-encoded associative array of pixel values and the count of each. This is called from PHP using the shell_exec() operator.

#!/usr/bin/env python
"""
A command-line tool to analyze the raster clipped to a geometry

and print out a JSON-encoded associative array of pixel=>count
 

- The submitted geometry MUST be in the same SRS as the target raster. See those definitions below.
- The geometry MUST be simplified to only a few hundred vertices, or else you risk tanking the server by consuming 6 GB of RAM.

# Newberg Oregon, in the stated SRSEPSG:26910 (UTM zone 10 WGS84 datum)
python histogram_groundcover.py "MULTIPOLYGON(((502664 5019820,502895 5019229,504881 5018580,504883 5017770,505596 5017772,505165 5017463,505352 5016998,505830 5016998,505830 5016074,505461 5016065,505460 5015669,504591 5015666,504665 5016153,504352 5016157,504349 5014760,503990 5014719,503990 5014246,503901 5014697,503011 5014503,502600 5014722,501617 5014157,501164 5014523,501661 5014877,501204 5015506,501365 5015770,500791 5016133,501262 5016711,500838 5016596,501159 5016982,501019 5018184,501266 5018179,501010 5018557,501268 5018554,501269 5019571,502664 5019820)))"

# Barlow Oregon, in the stated SRS
python histogram_groundcover.py "POLYGON((521879 5011246,522182 5011140,521751 5010750,521751 5010966,521358 5010966,521879 5011246))"
"""

# the raster file
RASTERFILE = "/projects/oregon/rasters/nlcd2005.tif"

# the SRS of the raster

# EPSG:26910 (UTM zone 10 WGS84 datum)
SRID = 26910

##########################################################################################

import sys
from osgeo import ogr, osr
import agoodle, numpy
import simplejson

def process_raster_wkt(rasterfile,wkt):
    raster = agoodle.AGoodle(rasterfile)

    srs = osr.SpatialReference()
    srs.ImportFromEPSG(SRID)

    polygon = ogr.CreateGeometryFromWkt(wkt)
    polygon.AssignSpatialReference(srs)

    histogram = {}

    # if it's already a simple Polygon, we can analyze it as-is
    # this being the only ring-histogram, we can update the total histogram as-is
    if wkt[0:7] == 'POLYGON':
        ringhist = raster.summarize_wkt(wkt)
        for pixel,count in ringhist.items():
            histogram[ str(pixel) ] = count + histogram.get(pixel,0)
    # a multipolygon, we iterate over the component Polygons and add them together
    elif wkt[0:12] == 'MULTIPOLYGON':
        for i in range(0,polygon.GetGeometryCount()):
            # go over this ring's vertices...
            # if the input geometry is a simple Polygon, then the ring is in fact a LinearRing already
            # if it's a MULTIpolygon then the ring is really a Polygon, and we dive a level deeper to get at the LinearRing
            ring = polygon.GetGeometryRef(i)
            ringhist = raster.summarize_wkt( str(ring) )
            for pixel,count in ringhist.items():
                histogram[ str(pixel) ] = count + histogram.get(pixel,0)
    else:
        print "Unsupported geometry given."
        print "Only POLYGON and MULTIPOLYGON are supported."
        sys.exit(2)

    # cast all of the counts to integers; for some reason agoodle returns the counts as floats
    for key in histogram.keys():
        histogram[key] = int( histogram[key] )

    # done, hand it back
    return histogram


if __name__ == "__main__":
    try:
        wkt = sys.argv[1]
    except IndexError:
        print "Supply a WKT geometry to compare against the raster."
        print "WKT geometry must be in EPSG:%d" % SRID
        sys.exit(1)

    # calculate the pixel counts, and spit out as JSON
    histo = process_raster_wkt(RASTERFILE,wkt)
    print simplejson.dumps(histo)

A few caveats:
  • It does not attempt to reproject your geometry from/to any particular SRS, it is presumed to be in the correct SRS already. For us, this is rarely a real issue, as we can bounce it off of PostGIS if necessary, or take a copy of the raster in the necessary SRS.
  • With very large areas, the memory consumption may be excessive (see above). This solution may not work for you at all if you need to do large regions (large numbers of pixels).
  • It really helps to simplify the polygon before using agoodle on it. The calculation of vertices does take some time, and simplification makes it a lot faster.


Putting it together: Querying to find acreage of land covers

The web application is written in JavaScript and HTML, and server-side components use PostGIS and PHP.
  • When the user picks an area from the list, they are given the (simplified) geometry of the area in WKT format, which is used client-side for highlighting.
  • The user can also draw an area, which has much the same result: an OpenLayers.Feature.Vector exists in the browser.
  • When the user wants to get statistics, this geometry is passed to the server via a POST request.
  • That POST endpoint accepts the geometry, and passes it to PostGIS for two things: 1) a simplified version to 500-meter resolution using ST_SimplifyPreserveTopology() and 2) the ST_Area() of the geometry in local UTM.
  • The POST endpoint uses shell_exec() to call this CLI program, gets back JSON, and parses it. This is simply $stats = json_decode( shell_exec("python histogram.py $geom") )
  • The POST endpoint can calculate the total number of pixels by adding up array_values($stats) Calculating acres is easy: the portion of total pixels fitting each pixel value, is the proportion of the total acres calculated earlier. e.g. 1,500,000 pixels total and landcover code 11 is 300,000 pixels, means that 0.20 (20%) of the total acreage is landcover code 11.
An alternative approach to calculating the acres, is to know the resolution of the pixels, e.g. 30 meter squares. Given that, you can instead multiply the number of pixels by 30*30 to get square meters, then by a ratio of square meters to acres (0.000247105).