The last few days have been intense work, trying to get the routing stuff to work quickly now that we have a whole lot of nodes in there. The tricks that worked last week, are now hopelessly slow, so it's been a lot of scrabbling to catch up. And in the process I've come up with a few "outside the box" tricks for PostgreSQL's indexes, and some "obvious but it missed the first draft" methods for speeding up node selection.
THE BASIC NEED
Someone wants to route from a LatLng to another LatLng. Step 1 is to find the closest node to the starting point and the closest node to the ending point. This was done pretty simply:
$point = sprintf("ST_Transform(ST_GeometryFromText('POINT(%f %f)',4326),3734)", $_GET['lng'], $_GET['lat'] );
SELECT *, ST_Distance(the_geom, $point) AS distance FROM route_trails WHERE type!='Road' ORDER BY distance LIMIT 1
Now that we've loaded up the full dataset, this takes way too long, often a few minutes. No surprise: there are some 381,000 segments in all and 22,000 of them are not Road, so that's 22,000 distances to calculate.
So, what can we do to speed it up?
- Speed up the filter for non-road nodes, so we get our subset more quickly.
- Add a bounding box filter (the && operator) so we don't calculate distances for all 22,000 nodes, but only those within some acceptable distance. This does fit a real-world need too: if there's no routing node within a few miles of me, I shouldn't expect to be able to route here.
FINDING NON-ROADS: INDEXES ON TEXT FIELDS
The "type" field is indexed but is still quite slow. This means that even the first filter will take some time, before we even get to distances. Look at these costs: the indexed varchar field is still very expensive, and if we create a boolean field the index scan still isn't significantly faster.
explain SELECT * FROM routing_trails WHERE type='Road';
-------------------------------------------------------------------------------------------------
Bitmap Heap Scan on routing_trails (cost=8687.13..364331.17 rows=359073 width=1376)
Recheck Cond: ((type)::text = 'Road'::text)
-> Bitmap Index Scan on routing_trails_idx_type (cost=0.00..8597.36 rows=359073 width=0)
Index Cond: ((type)::text = 'Road'::text)
explain SELECT * FROM routing_trails WHERE type!='Road';
--------------------------------------------------------------------------
Seq Scan on routing_trails (cost=0.00..553578.76 rows=21708 width=1376)
Filter: ((type)::text <> 'Road'::text)
explain SELECT * FROM routing_trails WHERE is_road=true;But, I tinkered around a bit and found that the IS NULL operator is very inexpensive, cheaper than the index scan matching values. Check this out:
-------------------------------------------------------------------------------------------------
Bitmap Heap Scan on routing_trails (cost=9519.48..363155.52 rows=357052 width=1374)
Filter: is_road
-> Bitmap Index Scan on routing_trails_idx_is_road (cost=0.00..9430.21 rows=357052 width=0)
Index Cond: (is_road = true)
-- make a boolean field, but don't set true/false, set instead true/NULLThe first rewrite, now looks like this. The distance stuff is still ahead of me, but at least the "not a road" stuff now executes in a second instead of a minute.
ALTER TABLE routing_trails ADD COLUMN is_road boolean;
UPDATE routing_trails set is_road=true WHERE type='Road';
CREATE INDEX routing_trails_idx_is_road ON routing_trails (is_road);
explain SELECT * FROM routing_trails WHERE is_road=true;
-------------------------------------------------------------------------------------------------
Bitmap Heap Scan on routing_trails (cost=9519.48..363155.52 rows=357052 width=1374)
Filter: is_road
-> Bitmap Index Scan on routing_trails_idx_is_road (cost=0.00..9430.21 rows=357052 width=0)
Index Cond: (is_road = true)
explain SELECT * FROM routing_trails WHERE is_road IS NULL;
------------------------------------------------------------------------------------------------------------
Index Scan using routing_trails_idx_is_road on routing_trails (cost=0.00..35491.92 rows=24034 width=1374)
Index Cond: (is_road IS NULL)
SELECT *, ST_Distance(the_geom, $point) AS distance FROM route_trails WHERE is_road IS NULL ORDER BY distance LIMIT 1
BOUNDING BOX: MAKE A POLYGON FROM THE POINT
Narrowing down to 22,000 non-road nodes was a great start, how can we narrow it down further? A simple bounding box filter. The vast majority of nodes are on the opposite end of the city and thus uirrelevant to me. And besides, for this use case it's acceptable to say "Sorry, no route" if someone's over a mile from the nearest node.
So, simple stuff: define a "radius" around the point, build a polygon, do a && test.
$point = sprintf("ST_Transform(ST_GeometryFromText('POINT(%f %f)',4326),3734)", $_GET['lng'], $_GET['lat'] );Voila, the planner now says that the cost of this function is 14,000 instead of 550,000 and I can verify that this reduces 22,000 nodes to about 500. This 40-to-1 speedup isn't purely hypothetical either: it's now finding my closest node in a few seconds rather than several minutes.
$boxbuffer = 0.05; // 3 miles or so at this latitude?
$box = sprintf("ST_Transform(ST_GeometryFromText('POLYGON((%f %f, %f %f, %f %f, %f %f, %f %f))',4326),3734)",
$_GET['lng']-$boxbuffer, $_GET['lat']-$boxbuffer,
$_GET['lng']-$boxbuffer, $_GET['lat']+$boxbuffer,
$_GET['lng']+$boxbuffer, $_GET['lat']+$boxbuffer,
$_GET['lng']+$boxbuffer, $_GET['lat']-$boxbuffer,
$_GET['lng']-$boxbuffer, $_GET['lat']-$boxbuffer
);
SELECT *, ST_Distance(the_geom, ST_Transform($point) AS distance FROM routing_trails WHERE is_road IS NULL AND the_geom && $box ORDER BY distance LIMIT 1