Thursday, February 19, 2015


So here was a fun puzzle, to occupy a few hours last week.

There's an agency which does inspections at certain locations, either levying a fine for failing to comply or giving them a thumb-up for following the rules. It's infeasible to visit every location in the state, so they have a sampling methodology.

Clusters and Samples

 Their testing methodology is as follows:
  • Define a set of clusters for the year. A "cluster" is a set of ZIP code boundaries,
  • which are contiguous,
  • containing a total of between 20 and 80 locations.
  • A "sample" is a set of locations (say 500),
  • composed by randomly selecting clusters and putting them together, to achieve that many locations (so maybe 6-25 zipcode clusters).
The rationale here is:
  • The use of clusters, means that locations would be visited and inspected in geographical blocks, which makes travel easier. An individual zipcode is too small, perhaps having 1 or 2 locations, or even none at all, and they prefer to send out a group to do 50-ish locations in an area.
  • The clusters themselves are randomly generated, so one can't claim favoritism and bias. A given zipcode may or may not be clustered in with any particular neighboring zipcodes from one year to the next.
  • To generate the sample, clusters are randomly selected: a cluster of 5 zipcodes and 32 locations  here, 8 zipcodes and 78 locations there, a cluster of 4 zipcodes and 21 locations, ... Again, no favoritism and bias -- it's a random collection of randomly-generated polygons.

Having read the first part, about the clustering system, I thought to myself: "A gerrymander maker!" Starting at a given location, it crawls over neighboring areas, collecting them into one giant polygon until that polygon has the desired characteristics. In this case it's number of locations, but hypothetically could be any other calculation doable on that polygon such as census demographics or voting habits.

Gerrymanders / Clusters

So here's part 1: the generation of those clusters.

  • Zipcode boundary polygons in a table.
  • Each zipcode polygon has an attribute locationcount which is the number of location-points contained inside each zipcode polygon. This was calculated earlier in desktop GIS since they wanted those statistics anyway. (see also
  • The zipcode polygons are reasonably well digitized so their edges touch, but human error means that two neighboring zipcodes may not actually touch. Note the buffer operation below which accommodates this.
  • The examples below are PHP using Kohana's DB API, with PostgreSQL and PostGIS as the database and GIS heavy-lifting.
  • Take a list of all zipcode polygons, iterate over them.
  • For each one, do a "while true" collecting neighboring polygons and neighboring-neighboring polygons and neighboring-neighboring-neighboring polygons... until we have a list of polygon ID#s whose summed locationcount is in range.
  • Keep an associative array of polygon ID#s that have already been used in a cluster. If this polygon is already in a cluster, skip it. If not, well, it is now.
  • Having accumulated a set of clusters (cluster 1234 = polygons  [ 56, 789, 1023, 45, 67, 8910 ]) get the geom union of those selected zipcode polygons, to get the geometry of the cluster as a whole. This is done in a second pass, for performance and for debugging: the two steps can easiy be debugged separately from each other.

// the preferred range of locations in a cluster
// the maximum is not enforced: if we go from 19 to 81 it's still >=20 and we call it
// however, in the final checking/reporting stage we do tally up clusters whose
// location population is outside this range, and report them as being under-populated or over-populated just FYI
$cluster_locations_min = 20;
$cluster_locations_max = 80;

// create our registry of clusters: uniqueID => cluster-info ( name, polygonids, geom, locationcount )
//     the unique ID for a cluster is not umportant, as long as it's unique
//     we use the ID# of the first zipcode polygon
$clusters    = array();

// create a quick-reference of zipcode polygons which have already been assigned to a cluster
// so we can skip over them quickly
// remember: associative array lookups are swift: array-within-array lookups are slow
$already     = array( '0'=>TRUE ); // don't let this be empty, makes for invalid NOT IN () clauses

// make up some prepared statements
// to fetch the total number of locations out of a given set of polygon ID#s, so we can construct clusters
// to fetch a random neighbor of the geomunion of a given set of polygon ID#s, so we can construct clusters (the cluster will sprawl and crawl, neighbors of neighbors)
//      tip: buffering a little bit, allows us to make up for not-quite-precise digitization
// to fetch the total union geometry of a set of polygon ID#s, used when we have a final set of IDs and want to grab the final geometry
$findclusterlocationcount  = DB::query(Database::SELECT, "SELECT SUM(locationcount) AS locationcount FROM zipcodeboundaries WHERE id IN :clusteridlist");
$findclusterrandomneighbor = DB::query(Database::SELECT, "SELECT id FROM zipcodeboundaries WHERE ST_INTERSECTS( the_geom, (SELECT ST_BUFFER(ST_UNION(the_geom),0.0001) FROM zipcodeboundaries WHERE id IN :clusteridlist ) ) AND id NOT IN :clusteridlist AND id NOT IN :alreadylist ORDER BY RANDOM() LIMIT 1");
$findclustergeomunion      = DB::query(Database::SELECT, "SELECT ST_MULTI(ST_UNION(the_geom)) AS the_geom FROM zipcodeboundaries WHERE id IN :clusteridlist");

// pass 1: collect the polygon ID#s to form clusters

// the list of all Polygon ID#s so we can iterate over them and construct a cluster from each one
// the name of the polygons used merely for output/display/debugging
$all_polygons = DB::select('name,id')->from(zipcodeboundaries)->execute();

// start in on a cluster, and recursively keep adding RANDOMLY-SELECTED neighboring polygons to it
// until we collect a location count in the acceptable range
foreach ($all_polygons as $polygon) {
    // first and easiest: if this polygon has already been assigned to a cluster, we don't need to process it at all
    if (array_key_exists($polygon['id'],$already)) continue;

    // start by assuming that this cluster is this 1 polygon and has no locations
    // then use the while loop to keep spreading outward
    $polygon_ids_in_this_cluster = array($polygon['id']);
    $locationcount               = 0;
    while (TRUE) {
        $locationcount = $findclusterlocationcount->param(':clusteridlist',$polygon_ids_in_this_cluster)->execute();
        $locationcount = $locationcount[0]['locationcount'];

        // if we exceed the minimum then we have a cluster
        if ($locationcount >= $cluster_locations_min) {
            // verbose output
            print "     Complete Cluster: ID {$polygon['id']} with $locationcount locations. Components are: " . implode(',',$polygon_ids_in_this_cluster);
            // tag all component polygons as taken
            foreach ($polygon_ids_in_this_cluster as $id) $already[$id] = TRUE;
            // log the cluster
            $clusters[ $polygon['id'] ] = array( 'name'=>$polygon['name'], 'polygonids'=>$polygon_ids_in_this_cluster, 'locationcount'=>$locationcount );
            // done

        // got here so we don't have enough locations yet
        // pick a random neighbor and add it to the $polygon_ids_in_this_cluster and try again
        // if we come up with 0 neighbors (all neighbors already in a cluster) then we give up,
        //     and accept that this cluster is undersized
        $new_neighbor_id = $findclusterrandomneighbor->param(':clusteridlist',$polygon_ids_in_this_cluster)->param(':alreadylist',array_keys($already))->execute();
        if (! @$new_neighbor_id[0]['id']) {
            // verbose output
            print "     Undersized Cluster: ID {$polygon['id']} with $locationcount locations. Components are: " . implode(',',$polygon_ids_in_this_cluster);
            // tag all component polygons as taken
            foreach ($polygon_ids_in_this_cluster as $id) $already[$id] = TRUE;
            // log the cluster
            $clusters[ $polygon['id'] ] = array( 'name'=>$polygon['name'], 'polygonids'=>$polygon_ids_in_this_cluster, 'locationcount'=>$locationcount );
            // done
        $polygon_ids_in_this_cluster[] = $new_neighbor_id[0]['id'];
printf("Done calculating %d clusters.", sizeof($clusters) );

// pass 2: we have polygon IDs, use geom union to create geometries for each cluster
// $clusters is an assocarray of a cluster-ID mapped onto info: list of polygon IDs, location count

printf("Calculating geometry union for %d clusters.", sizeof($clusters) );
foreach (array_keys($clusters) as $clusterid) {
    $unions = $findclustergeomunion->param(':clusteridlist',$clusters[$clusterid]['polygonids'])->execute();
    $clusters[$clusterid]['the_geom'] = $unions[0]['the_geom'];
    $clusters[$clusterid]['the_geom'] = $unions[0]['the_geom'];

// save to database

print "Saving to database.";
foreach (array_values($clusters) as $cluster) {
    $fields = array_keys($cluster);
    $values = array_values($cluster);
    DB::insert('sampling_clusters', $fields)->values($values)->execute();
Database::instance()->query(NULL, "VACUUM FULL ANALYZE sampling_clusters");
print "Done saving.";

// generate some statistics about the generated cluster set
// not vital, but informative
$stats = array();
$stats['total_clusters'] = sizeof($clusters);
$stats['notenoughlocations'] = 0;
$stats['toomanylocations']   = 0;
$stats['locationcount_min'] = array(); // these are lists during the loop and we use min() and max() below; faster than doing "if X > min, min=X" here in the loop
$stats['locationcount_max'] = array(); // these are lists during the loop and we use min() and max() below; faster than doing "if X > min, min=X" here in the loop
$stats['locationcount_avg'] = 0;       // this is summed up in the loop; then divide by number of clusters for average
foreach (array_values($clusters) as $cluster) {
    $stats['locationcount_min'][] = $cluster['locationcount'];
    $stats['locationcount_max'][] = $cluster['locationcount'];
    $stats['locationcount_avg']  += $cluster['locationcount'];

    if ($cluster['locationcount'] < $cluster_locations_min) $stats['notenoughlocations']++;
    if ($cluster['locationcount'] > $cluster_locations_max) $stats['toomanylocations']++;

$stats['polygoncount_min'] = min($stats['polygoncount_min']);
$stats['polygoncount_max'] = max($stats['polygoncount_max']);
$stats['polygoncount_avg'] /= sizeof($clusters);

$stats['locationcount_min'] = min($stats['locationcount_min']);
$stats['locationcount_max'] = max($stats['locationcount_max']);
$stats['locationcount_avg'] /= sizeof($clusters);

foreach ($stats as $keyword=>$value) {
    print "STATS: $keyword = $value";


These excerpts from three runs, show the sort of randomness and variation we're looking for. Polygon number 76 (I don't know its ZIP code, doesn't matter) sometimes got joined up to 1 other zipcode, sometimes to 4, and the resulting cluster was right on target for number of locations.

Tallied up a Cluster: 76 : 23 Locations / Polygon IDs 76,167,102,107
Tallied up a Cluster: 100 : 31
Locations / Polygon IDs 100,94,99
Tallied up a Cluster: 392 : 39
Locations / Polygon IDs 392,16

Tallied up a Cluster: 76 : 20
Locations / Polygon IDs 76,75
Tallied up a Cluster: 100 : 21
Locations / Polygon IDs 100,106,101,117
Tallied up a Cluster: 392 : 28
Locations / Polygon IDs 392,393

Tallied up a Cluster: 76 : 27
Locations / Polygon IDs 76,167,173,102,145
Tallied up a Cluster: 100 : 26
Locations / Polygon IDs 100,133,101,104,267
Tallied up a Cluster: 392 : 30
Locations / Polygon IDs 392,395

Even the undersized clusters follow the pattern we hope for, of not being consistent. Polygon 204 has no locations at all. In one case it was stuck into a corner and resulted in a cluster with 0 locations (not a problem, means nobody will be visiting that area) but on other runs, Polygon 204 got paired up with 6 or 4 or 10 others.

Undersized Cluster: 204 : Locations 0 / Polygon IDs 204
Tallied up a Cluster: 199 : 22 Locations / Polygon IDs 199,210,304,201,209,204,289
Tallied up a Cluster: 207 : 30 Locations / Polygon IDs 207,211,210,194,198,204,209,199,201,305,304,289
Not too shabby!

Next: Sampling Those Clusters

So there it is, a gerrymander maker. Our criteria for a cluster being acceptable, is a given number of locations contained within them. But it could just as easily have been based on income, race, voting habits... Point is, keep accumulating areas until you get the demographic you want, and there's your district / cluster / whatever.

And the neat thing is that even for a good-sized state in the western USA, this thing executes in about a minute. So if you aren't happy with the results, you can grind through it as many times as you like.

No comments:

Post a Comment