Monday, September 2, 2013

agoodle for finding pixels above/below a threshold

In my previous post, I used agoodle to derive an associative array of pixel values onto pixel counts, and used this to calculate land cover statistics for a polygon. This was using classified data: pixels fall into one of these 15 specific pixel codes, period.

For a different project, we needed to calculate the min/max/mean of the values within a given polygon. This isn't classified data, but floats, so the histogram idea covered last week may not be the best approach. Second challenge: we need to do this for 12 rasters, and at 1-2 seconds apiece this meant some 20 seconds, and what can we do to speed it up?



The Program

Well, let's skip straight to the code.
#!/usr/bin/python
"""
Analyze a set of several rasters, clipping to a given geometry in the same SRS as the raster.
Return a JSON structure containing the min/max/mean of values for each raster.

- SRS of the given geometry must be specified as a command-line parameter
- Simplifying the geometry is highly advised, as it makes it run a lot faster.
- agoodle can consume a lot of RAM

# Newberg Oregon
python climatestats.py 26910 "MULTIPOLYGON(((502664 5019820,502895 5019229,504881 5018580,504883 5017770,505596 5017772,505165 5017463,505352 5016998,505830 5016998,505830 5016074,505461 5016065,505460 5015669,504591 5015666,504665 5016153,504352 5016157,504349 5014760,503990 5014719,503990 5014246,503901 5014697,503011 5014503,502600 5014722,501617 5014157,501164 5014523,501661 5014877,501204 5015506,501365 5015770,500791 5016133,501262 5016711,500838 5016596,501159 5016982,501019 5018184,501266 5018179,501010 5018557,501268 5018554,501269 5019571,502664 5019820)))"

# Barlow Oregon
python climatestats.py 26910 "POLYGON((521879 5011246,522182 5011140,521751 5010750,521751 5010966,521358 5010966,521879 5011246))"
"""

# the list of rasters to process, and what key to store them in
# in the output structure, e.g. OUTPUT_STATS['tmin_2010']
TIFFS = [
    'tmin_2010' : '/projects/oregon/rasters/min_temperature_current.tif',
    'tmin_2040' : '/projects/oregon/rasters/min_temperature_2040.tif',
    'tmin_2070' : '/projects/oregon/rasters/min_temperature_2070.tif',
    'tmin_2100' : '/projects/oregon/rasters/min_temperature_2100.tif',
    'tmax_2010' : '/projects/oregon/rasters/max_temperature_current.tif',
    'tmax_2040' : '/projects/oregon/rasters/max_temperature_2040.tif',
    'tmax_2070' : '/projects/oregon/rasters/max_temperature_2070.tif',
    'tmax_2100' : '/projects/oregon/rasters/max_temperature_2100.tif',
    'ppt_2010' : '/projects/oregon/rasters/precipitation_current.tif',
    'ppt_2040' : '/projects/oregon/rasters/precipitation_2040.tif',
    'ppt_2070' : '/projects/oregon/rasters/precipitation_2070.tif',
    'ppt_2100' : '/projects/oregon/rasters/precipitation_2100.tif',
]

# the SRS of the rasters
TIF_EPSG = 900913

# the NODATA value used in the TIFFs
# yeah, all of them must have the same NODATA value
NODATA = -9999

# the structure for the final collected statistics from all TIFFs
# this is a global so it can be written by multiple threads,
# cuz that what we;'ll be doing to make this thing faster
OUTPUT_STATS = {}

##############################################

import simplejson, sys, os
from osgeo import ogr, osr
import numpy as np
import agoodle
import threading
import time

# cribbed from agoodle.py
# given a WKT which we know is a single, simple polygon,
# pull out the vertex points as a numpy-compatible array, and the bounding box in raster SRS
def points_from_wkt(wkt, from_epsg=900913, to_epsg=TIF_EPSG):
    # unpack the WKT into a geometry
    g = ogr.CreateGeometryFromWkt(wkt)

    # knowing that it's a polygon, get the pixels of the outer ring
    ring = g.GetGeometryRef(0)
    pts = []
    for i in range(ring.GetPointCount()):
        x, y, z = ring.GetPoint(i)
        pts.append((x, y))
    pts = np.array(pts)
    bbox = (pts[:, 0].min(), pts[:, 1].min(), pts[:, 0].max(), pts[:, 1].max())
    return pts, bbox


def get_minmaxmean(tiff,wkt,from_epsg):
    # create the agoodle raster handle, reproject the polygon to match the raster SRS
    raster = agoodle.AGoodle(tiff)

    tif_srs = osr.SpatialReference()
    tif_srs.ImportFromEPSG(TIF_EPSG)

    poly_srs = osr.SpatialReference()
    poly_srs.ImportFromEPSG(from_epsg)

    polygon = ogr.CreateGeometryFromWkt(wkt)
    polygon.AssignSpatialReference(poly_srs)
    polygon.TransformTo(tif_srs)

    # collect the rings to analyze; for a single polygon this is dead simple
    rings_wkt = []
    if wkt[0:7] == 'POLYGON':
        # plain polygon, use the WKT as-is
        rings_wkt.append(wkt)
    elif wkt[0:12] == 'MULTIPOLYGON':
        # a multipolygon, we iterate over the component Polygons and add them to the list
        for i in range(0,polygon.GetGeometryCount()):
            ring = polygon.GetGeometryRef(i)
            rings_wkt.append( str(ring) )

    # go over the rings,m each a simple Polygon geometry in WKT, and in the raster's SRS
    # clip to the bbox then to the polygon bounds; cribbed from agoodle.py, a variant of what's done in summarize_stats()
    # for each resulting numpy array, pull the min, max, and mean and exclude and 0-NODATA values
    # and append to the array. yes, minimum is not a minimum but an array of minimum values, one per ring
    # at a later phase we find the mean of means, the maximum of maximums, the minimum of minimums, etc.
    minimum = []
    maximum = []
    average = []
    for ring_wkt in rings_wkt:
        pts, bbox = points_from_wkt(ring_wkt)
        ag = raster.read_array_bbox(bbox)
        number_array = ag.mask_with_poly(pts, copy=True, mask_value=NODATA)

        # filter out the NODATA cells
        # then filter out 0s if we're doing a temperature dataset (they screw up the means for temperature only)
        # then bail if we in fact have no match at all (outside the area)
        number_array = number_array[number_array != NODATA]
        if not len(number_array):
            continue

        average.append( number_array.mean() )
        maximum.append( number_array.max() )
        minimum.append( number_array.min() )

    # the minimum, maximum, and average are now lists, of the min/max/mean per ring
    # pull the average of the averages, the maximum of the maximums, etc. and we're done
    minimum = min(minimum)
    maximum = max(maximum)
    average = float(sum(average)) / len(average)
    return { 'min':float(minimum), 'max':float(maximum), 'mean':float(average) }


def process_tiff_in_thread(tiff,key,wkt,poly_epsg):
    # the TIFF and SRS are required for analysis; the key is so we can put the results into the output structure
    these_stats = get_minmaxmean(tiff, wkt, poly_epsg)
    OUTPUT_STATS[key] = these_stats['mean']

    # some profiling
    #print repr([ 'Thread done', tiff, time.strftime("%H:%M:%S") ])



if __name__ == '__main__':
    try:
        epsg = int( sys.argv[1] )
    except IndexError:
        print "Specify the numeric ID (SRID) of the SRS of the geometry"
        print "Use only the number, e.g. 3857 for Web Mercator"
        sys.exit(1)
    try:
        wkt = sys.argv[2]
    except IndexError:
        print "Supply a WKT geometry to compare against the raster."
        print "WKT geometry must be in EPSG:%d" % SRID
        sys.exit(1)

    # some profiling
    #print "Processing started %s" % time.strftime("%H:%M:%S")

    for key,tiff in TIFFS.items():
        run_thread = threading.Thread(target=process_tiff_in_thread,args=(tiff, key, wkt, epsg ))
        #print repr([ 'Thread started', tiff, time.strftime("%H:%M:%S") ])
        run_thread.start()
    # keep looping until the last thread exits
    while threading.activeCount() > 1:
        time.sleep(1)

    # some profiling
    #print "Processing finished %s" % time.strftime("%H:%M:%S")

    # done, encode as JSON and spit it out
    print simplejson.dumps(OUTPUT_STATS, True)

Discussion


That command-line program accepts a SRID and a geometry, and will reproject the stated geometry into the SRS of the raster. This is different from my previous posting, which did no repropjection. It will handle both simple polygons and multipolygons, accumulating a list of stats for each ring and then taking the maximum of maximums, minimum of minimums, etc. (does anyone say "maxima" as the plural of maximum?) And the output is again a JSON structure.

Because there are 12 rasters to process and each one takes 1-2 seconds, I made it multithreaded. All 12 rasters will be processed in parallel, so the total time is closer to 4 seconds than to the 18 that would result if they were done sequentially.

Because of the multithreading, OUTPUT_STATS is kept in the global scope, so all threads may write to it.


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.