Saturday, July 28, 2012

Routing is quick, finding the node is slow!

The last few days have been intense work, trying to get the routing stuff to work quickly now that we have a whole lot of nodes in there. The tricks that worked last week, are now hopelessly slow, so it's been a lot of scrabbling to catch up. And in the process I've come up with a few "outside the box" tricks for PostgreSQL's indexes, and some "obvious but it missed the first draft" methods for speeding up node selection.


THE BASIC NEED

Someone wants to route from a LatLng to another LatLng. Step 1 is to find the closest node to the starting point and the closest node to the ending point. This was done pretty simply:

$point = sprintf("ST_Transform(ST_GeometryFromText('POINT(%f %f)',4326),3734)", $_GET['lng'], $_GET['lat'] );

SELECT *, ST_Distance(the_geom, $point) AS distance FROM route_trails WHERE type!='Road' ORDER BY distance LIMIT 1

Now that we've loaded up the full dataset, this takes way too long, often a few minutes. No surprise: there are some 381,000 segments in all and 22,000 of them are not Road, so that's 22,000 distances to calculate.

So, what can we do to speed it up?
  • Speed up the filter for non-road nodes, so we get our subset more quickly.
  • Add a bounding box filter (the && operator) so we don't calculate distances for all 22,000 nodes, but only those within some acceptable distance. This does fit a real-world need too: if there's no routing node within a few miles of me, I shouldn't expect to be able to route here.

FINDING NON-ROADS: INDEXES ON TEXT FIELDS

The "type" field is indexed but is still quite slow. This means that even the first filter will take some time, before we even get to distances. Look at these costs: the indexed varchar field is still very expensive, and if we create a boolean field the index scan still isn't significantly faster.
explain SELECT * FROM routing_trails WHERE type='Road';
-------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on routing_trails  (cost=8687.13..364331.17 rows=359073 width=1376)
   Recheck Cond: ((type)::text = 'Road'::text)
   ->  Bitmap Index Scan on routing_trails_idx_type  (cost=0.00..8597.36 rows=359073 width=0)
         Index Cond: ((type)::text = 'Road'::text)

explain SELECT * FROM routing_trails WHERE type!='Road';
--------------------------------------------------------------------------
 Seq Scan on routing_trails  (cost=0.00..553578.76 rows=21708 width=1376)
   Filter: ((type)::text <> 'Road'::text)

explain SELECT * FROM routing_trails WHERE is_road=true;
-------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on routing_trails  (cost=9519.48..363155.52 rows=357052 width=1374)
   Filter: is_road
   ->  Bitmap Index Scan on routing_trails_idx_is_road  (cost=0.00..9430.21 rows=357052 width=0)
         Index Cond: (is_road = true)
But, I tinkered around a bit and found that the IS NULL operator is very inexpensive, cheaper than the index scan matching values. Check this out:
-- make a boolean field, but don't set true/false, set instead true/NULL
ALTER TABLE routing_trails ADD COLUMN is_road boolean;
UPDATE routing_trails set is_road=true WHERE type='Road';
CREATE INDEX routing_trails_idx_is_road ON routing_trails (is_road);

explain SELECT * FROM routing_trails WHERE is_road=true;
-------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on routing_trails  (cost=9519.48..363155.52 rows=357052 width=1374)
   Filter: is_road
   ->  Bitmap Index Scan on routing_trails_idx_is_road  (cost=0.00..9430.21 rows=357052 width=0)
         Index Cond: (is_road = true)

explain SELECT * FROM routing_trails WHERE is_road IS NULL;
------------------------------------------------------------------------------------------------------------
 Index Scan using routing_trails_idx_is_road on routing_trails  (cost=0.00..35491.92 rows=24034 width=1374)
   Index Cond: (is_road IS NULL)
The first rewrite, now looks like this. The distance stuff is still ahead of me, but at least the "not a road" stuff now executes in a second instead of a minute.
SELECT *, ST_Distance(the_geom, $point) AS distance FROM route_trails WHERE is_road IS NULL ORDER BY distance LIMIT 1

BOUNDING BOX: MAKE A POLYGON FROM THE POINT

Narrowing down to 22,000 non-road nodes was a great start, how can we narrow it down further? A simple bounding box filter. The vast majority of nodes are on the opposite end of the city and thus uirrelevant to me. And besides, for this use case it's acceptable to say "Sorry, no route" if someone's over a mile from the nearest node.

So, simple stuff: define a "radius" around the point, build a polygon, do a && test.
    $point = sprintf("ST_Transform(ST_GeometryFromText('POINT(%f %f)',4326),3734)", $_GET['lng'], $_GET['lat'] );
    $boxbuffer = 0.05; // 3 miles or so at this latitude?
    $box = sprintf("ST_Transform(ST_GeometryFromText('POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))',4326),3734)",
        $_GET['lng']-$boxbuffer, $_GET['lat']-$boxbuffer,
        $_GET['lng']-$boxbuffer, $_GET['lat']+$boxbuffer,
        $_GET['lng']+$boxbuffer, $_GET['lat']+$boxbuffer,
        $_GET['lng']+$boxbuffer, $_GET['lat']-$boxbuffer,
        $_GET['lng']-$boxbuffer, $_GET['lat']-$boxbuffer
    );

SELECT *, ST_Distance(the_geom, ST_Transform($point) AS distance FROM routing_trails WHERE is_road IS NULL AND the_geom && $box ORDER BY distance LIMIT 1
Voila, the planner now says that the cost of this function is 14,000 instead of 550,000 and I can verify that this reduces 22,000 nodes to about 500. This 40-to-1 speedup isn't purely hypothetical either: it's now finding my closest node in a few seconds rather than several minutes.

2 comments:

  1. Nice post,
    Have you thought about using the <-> operator as part of PostGIS 2.0? This does a sorted list from a GiST index, ala http://blog.opengeo.org/2011/09/28/indexed-nearest-neighbour-search-in-postgis/

    Best,

    ReplyDelete
  2. Very nice! I hadn't seen that new filter.

    This new method is significantly simpler, which is great for code maintenance:

    SELECT * FROM routing_trails WHERE is_road IS NULL ORDER BY the_geom <-> $point LIMIT 1

    I implemented this at the project being discussed, and in those cases it executes in 2ms instead of 60ms. It's not a huge difference, but it does mean less load on the server, which is always good.

    ReplyDelete