Gregor The Map Guy: September 2013

A few weeks ago, I posted some techniques for using Python and agoodle to analyze raster data,clipping to a polygon and then reporting min/max/mean and pixel counts. Another technique available as of a few months ago, is PostGIS Raster.

PostGIS 2.0 brings a lot of small changes and improvements, among them support for processing raster datasets. Yes, all the wonderful GEOS goodness of PostGIS... can now be done on rasters too.

Importing The Data

PostGIS can in fact import the binary content of your raster file into the database, as a set of binary blobs. We don't do this; we keep the raster in the filesystem (the -R flag) so we can download it, run other command-line programs, etc. Our use of raster2pgsql looks like this:

raster2pgsql -s 3174 -t 1000x1000 -I -C -F -R /projects/maine/rasters/landvalue_2005.tif

You can read the help in raster2pgsql to see the details about each flag, but in summary:

The table is created with a reference to a raster in the filesystem, and the SRS is stated.
The table will have multiple rows, each one being a 1000x1000 slice of the raster.
The column rast refers to the raster content.

Querying A Slice of The Data

So, we have a polygon such as county boundary, and classified raster data such as NLCD land cover codes, and we want to find the pixel values and pixel counts within that county.

First, a test of the raster and pixel counting:

SELECT ST_ValueCount(rast,1,false)
AS valuecount FROM nlcd2005;

This performs a ST_ValueCount() on the rast column, without any other sort of filtering, clipping, etc. This returns back a set of rows, each row being an array of 2 elements: the pixel value and the count of pixels having that value.

(0,23975242)
(1,442029)
(2,265327)
(3,42985)

Let's do this again, but this time clipping to a polygon:

SELECT ST_ValueCount(
ST_Clip(rast,1,
ST_Transform(ST_GeomFromText('POLYGON((521879 5011246,522182 5011140,521751 5010750,521751 5010966,521358 5010966,521879 5011246))',26910), 3174)
,1,false
)
AS valuecount FROM nlcd2005;

All this did, was replace "rast", the first parameter to ST_ValueCount(), with a call to ST_Clip() which clips a raster to a given geometry, then returns the resulting raster. Simple.

Processing The Value Counts

A lot of the raster functions return a set of rows, and these are a nuisance to work with. You could tease them apart like this:

SELECT (ST_ValueCount(rast,1,false)).value AS pixel,
(ST_ValueCount(rast,1,false)).count AS howmany
FROM nlcd2005;

But this has two problems: a) It runs the value count twice, which takes twice as long; and b) those field names within the record are poorly documented at this time, so you may take some guessing or reading the source code to find them.

Eventually I gave up on this, and took the easy way out: regular expressions. Given that it's a set of numbers inside parens, it's pretty simple to just process it within PHP where we would be handling these row results anyway:

$counts = array();
$bits = null;
preg_match('/$(\d+)\,(\d+)$/', $pxcount->valuecount, $bits );
$pixel = $bits[1];
$count = $bits[2];
$counts[$pixel] += $count;

And now you have an associative array of pixel values onto pixel counts. From here, follow your usual techniques to figure out acres, percentages, or whatever you had in mind.

Pixel Counts To Acres

The easiest way to find acres, is to multiply the pixel count by the known size in meters/feet, then by the conversion factor to get acres. Like this:

$acres_nlcd_11 = $counts['11'] * 30 * 30 * 0.000247105;

If you don't know the pixel size of this specific raster, PostGIS Raster can tell you: ST_PixelWidth() and ST_PixelHeight() It won't tell you the units, though.

In my previous post, I used agoodle to derive an associative array of pixel values onto pixel counts, and used this to calculate land cover statistics for a polygon. This was using classified data: pixels fall into one of these 15 specific pixel codes, period.

For a different project, we needed to calculate the min/max/mean of the values within a given polygon. This isn't classified data, but floats, so the histogram idea covered last week may not be the best approach. Second challenge: we need to do this for 12 rasters, and at 1-2 seconds apiece this meant some 20 seconds, and what can we do to speed it up?

The Program

Well, let's skip straight to the code.

#!/usr/bin/python
"""
Analyze a set of several rasters, clipping to a given geometry in the same SRS as the raster.
Return a JSON structure containing the min/max/mean of values for each raster.

- SRS of the given geometry must be specified as a command-line parameter
- Simplifying the geometry is highly advised, as it makes it run a lot faster.
- agoodle can consume a lot of RAM

# Newberg Oregon
python climatestats.py 26910 "MULTIPOLYGON(((502664 5019820,502895 5019229,504881 5018580,504883 5017770,505596 5017772,505165 5017463,505352 5016998,505830 5016998,505830 5016074,505461 5016065,505460 5015669,504591 5015666,504665 5016153,504352 5016157,504349 5014760,503990 5014719,503990 5014246,503901 5014697,503011 5014503,502600 5014722,501617 5014157,501164 5014523,501661 5014877,501204 5015506,501365 5015770,500791 5016133,501262 5016711,500838 5016596,501159 5016982,501019 5018184,501266 5018179,501010 5018557,501268 5018554,501269 5019571,502664 5019820)))"

# Barlow Oregon
python climatestats.py 26910 "POLYGON((521879 5011246,522182 5011140,521751 5010750,521751 5010966,521358 5010966,521879 5011246))"
"""

# the list of rasters to process, and what key to store them in
# in the output structure, e.g. OUTPUT_STATS['tmin_2010']
TIFFS = [
    'tmin_2010' : '/projects/oregon/rasters/min_temperature_current.tif',
    'tmin_2040' : '/projects/oregon/rasters/min_temperature_2040.tif',
    'tmin_2070' : '/projects/oregon/rasters/min_temperature_2070.tif',
    'tmin_2100' : '/projects/oregon/rasters/min_temperature_2100.tif',
    'tmax_2010' : '/projects/oregon/rasters/max_temperature_current.tif',
    'tmax_2040' : '/projects/oregon/rasters/max_temperature_2040.tif',
    'tmax_2070' : '/projects/oregon/rasters/max_temperature_2070.tif',
    'tmax_2100' : '/projects/oregon/rasters/max_temperature_2100.tif',
    'ppt_2010' : '/projects/oregon/rasters/precipitation_current.tif',
    'ppt_2040' : '/projects/oregon/rasters/precipitation_2040.tif',
    'ppt_2070' : '/projects/oregon/rasters/precipitation_2070.tif',
    'ppt_2100' : '/projects/oregon/rasters/precipitation_2100.tif',
]

# the SRS of the rasters
TIF_EPSG = 900913

# the NODATA value used in the TIFFs
# yeah, all of them must have the same NODATA value
NODATA = -9999

# the structure for the final collected statistics from all TIFFs
# this is a global so it can be written by multiple threads,
# cuz that what we;'ll be doing to make this thing faster
OUTPUT_STATS = {}

##############################################

import simplejson, sys, os
from osgeo import ogr, osr
import numpy as np
import agoodle
import threading
import time

# cribbed from agoodle.py
# given a WKT which we know is a single, simple polygon,
# pull out the vertex points as a numpy-compatible array, and the bounding box in raster SRS
def points_from_wkt(wkt, from_epsg=900913, to_epsg=TIF_EPSG):
    # unpack the WKT into a geometry
    g = ogr.CreateGeometryFromWkt(wkt)

    # knowing that it's a polygon, get the pixels of the outer ring
    ring = g.GetGeometryRef(0)
    pts = []
    for i in range(ring.GetPointCount()):
        x, y, z = ring.GetPoint(i)
        pts.append((x, y))
    pts = np.array(pts)
    bbox = (pts[:, 0].min(), pts[:, 1].min(), pts[:, 0].max(), pts[:, 1].max())
    return pts, bbox

def get_minmaxmean(tiff,wkt,from_epsg):
    # create the agoodle raster handle, reproject the polygon to match the raster SRS
    raster = agoodle.AGoodle(tiff)

    tif_srs = osr.SpatialReference()
    tif_srs.ImportFromEPSG(TIF_EPSG)

    poly_srs = osr.SpatialReference()
    poly_srs.ImportFromEPSG(from_epsg)

    polygon = ogr.CreateGeometryFromWkt(wkt)
    polygon.AssignSpatialReference(poly_srs)
    polygon.TransformTo(tif_srs)

    # collect the rings to analyze; for a single polygon this is dead simple
    rings_wkt = []
    if wkt[0:7] == 'POLYGON':
        # plain polygon, use the WKT as-is
        rings_wkt.append(wkt)
    elif wkt[0:12] == 'MULTIPOLYGON':
        # a multipolygon, we iterate over the component Polygons and add them to the list
        for i in range(0,polygon.GetGeometryCount()):
            ring = polygon.GetGeometryRef(i)
            rings_wkt.append( str(ring) )

    # go over the rings,m each a simple Polygon geometry in WKT, and in the raster's SRS
    # clip to the bbox then to the polygon bounds; cribbed from agoodle.py, a variant of what's done in summarize_stats()
    # for each resulting numpy array, pull the min, max, and mean and exclude and 0-NODATA values
    # and append to the array. yes, minimum is not a minimum but an array of minimum values, one per ring
    # at a later phase we find the mean of means, the maximum of maximums, the minimum of minimums, etc.
    minimum = []
    maximum = []
    average = []
    for ring_wkt in rings_wkt:
        pts, bbox = points_from_wkt(ring_wkt)
        ag = raster.read_array_bbox(bbox)
        number_array = ag.mask_with_poly(pts, copy=True, mask_value=NODATA)

        # filter out the NODATA cells
        # then filter out 0s if we're doing a temperature dataset (they screw up the means for temperature only)
        # then bail if we in fact have no match at all (outside the area)
        number_array = number_array[number_array != NODATA]
        if not len(number_array):
            continue

        average.append( number_array.mean() )
        maximum.append( number_array.max() )
        minimum.append( number_array.min() )

    # the minimum, maximum, and average are now lists, of the min/max/mean per ring
    # pull the average of the averages, the maximum of the maximums, etc. and we're done
    minimum = min(minimum)
    maximum = max(maximum)
    average = float(sum(average)) / len(average)
    return { 'min':float(minimum), 'max':float(maximum), 'mean':float(average) }

def process_tiff_in_thread(tiff,key,wkt,poly_epsg):
    # the TIFF and SRS are required for analysis; the key is so we can put the results into the output structure
    these_stats = get_minmaxmean(tiff, wkt, poly_epsg)
    OUTPUT_STATS[key] = these_stats['mean']

    # some profiling
    #print repr([ 'Thread done', tiff, time.strftime("%H:%M:%S") ])

if __name__ == '__main__':
    try:
        epsg = int( sys.argv[1] )
    except IndexError:
        print "Specify the numeric ID (SRID) of the SRS of the geometry"
        print "Use only the number, e.g. 3857 for Web Mercator"
        sys.exit(1)
    try:
        wkt = sys.argv[2]
    except IndexError:
        print "Supply a WKT geometry to compare against the raster."
        print "WKT geometry must be in EPSG:%d" % SRID
        sys.exit(1)

    # some profiling
    #print "Processing started %s" % time.strftime("%H:%M:%S")

    for key,tiff in TIFFS.items():
        run_thread = threading.Thread(target=process_tiff_in_thread,args=(tiff, key, wkt, epsg ))
        #print repr([ 'Thread started', tiff, time.strftime("%H:%M:%S") ])
        run_thread.start()
    # keep looping until the last thread exits
    while threading.activeCount() > 1:
        time.sleep(1)

    # some profiling
    #print "Processing finished %s" % time.strftime("%H:%M:%S")

    # done, encode as JSON and spit it out
    print simplejson.dumps(OUTPUT_STATS, True)

Discussion

That command-line program accepts a SRID and a geometry, and will reproject the stated geometry into the SRS of the raster. This is different from my previous posting, which did no repropjection. It will handle both simple polygons and multipolygons, accumulating a list of stats for each ring and then taking the maximum of maximums, minimum of minimums, etc. (does anyone say "maxima" as the plural of maximum?) And the output is again a JSON structure.

Because there are 12 rasters to process and each one takes 1-2 seconds, I made it multithreaded. All 12 rasters will be processed in parallel, so the total time is closer to 4 seconds than to the 18 that would result if they were done sequentially.

Because of the multithreading, OUTPUT_STATS is kept in the global scope, so all threads may write to it.

Gregor The Map Guy

Saturday, September 21, 2013

Raster analysis in a web application, using PostGIS Raster