Schedule - PGConf.EU 2022
What happens when stack overflow doesn’t have an answer? comparing ST_within & H3 for spatial queries
Room: Berlin 2+3
Who we are & What we do
Blue Sky Analytics is a climate tech startup, building Bloomberg for green data using AI, satellite data & cloud-native technologies. We deal with a lot of large geospatial datasets and run many spatial-temporal queries over them; the spatial join is one of the bottlenecks we encountered with some queries simply failing to resolve.
Our use case
There are some very interesting use cases that we have to build for, eg “if you want to know the number of fires in a particular area during a particular time period aggregated over a sub-area and a temporal bucket” = “All US biomass emissions by state, on a monthly basis over last 5 years”. For this, we have to make complex Spatio-temporal joins to visualize fires across the globe over several years and attribute them to various shapes to build country & province-level estimates on a table that updates daily.
We use timescale & PostGIS to handle temporal and spatial components respectively. While this setup works well for most kinds of queries, but it’s not the best solution out there as we scale. And that’s where the Uber’s H3 comes in!
One very common query we have is “point in polygon”, while the
ST_Within function works but it’s very slow when there are a millions of points & the shapes are complex.
ST_Within is awesome but too slow to scale. It took
ST_Within 1 minute to sort a million points into 28 polygons ( just one country over 1 month). We have 10s of million points & 100s of shapes. Fine-tuning the database by adding more
work_mem and increasing CPU cores did help but barely. There’s only so much brute force you can apply; we knew we had to optimise at the lowest level.
One of our datasets is fire emissions (global, since the year 2012 & roughly 21 million points/year). We need to know where on earth these points lie, more precisely which admin boundaries to attribute to, from country to province level; The simplest way to bucket them spatially was to check each point against each shape using
ST_Within which turned out to be impossibly slow. And even if we pre-computed this, it was still impossible to build that pre-compute.
H3 indexing & pre-computed aggregations. With H3 indexing, we pre-computed the H3 value for each point and poly-filled all the shape polygons with the
h3_polyfill. Then we got two arrays (tables), which could be joined with just an equality operator (=) and it was almost 100X faster than using
ST_Within. The only limit we hit was the amount of data we were working with, which ultimately didn’t fit in the memory and the disk read-writes slowed down the query in real-time analysis.
Integrating the H3 extension into our setup was a bit challenging & also understanding how we could leverage this H3 index correctly. Initially, we thought we could just index the table using the H3 & the spatial query might become faster using it, but we were wrong.
Working with H3 required some trial & error as to what resolution to implement; As it had a major impact on our results in terms of speed and error. If we choose a very high resolution like 9, it took a lot of time to polyfill the polygon, and there were simply more computations to be made. With lower resolution, the points aggregated over a lesser number of grids but it leads to errors in results.
With our use case, we found resolution 7 was acceptable both in terms of speed & accuracy.
Overall, We were successfully able to reduce the time taken by the spatial query drastically. It went from ~ 15 minutes using
ST_Within (for the queries that ran at all) to 30 seconds using h3. We want to share our learnings with the larger PSQL community and help them save on emissions by cutting query times.
Along those lines, we would like to showcase our work with Postgres as a spatial-temporal database with TimescaleDB, PostGIS and H3.