Over the summer of 2024, Naurt has been rapidly increasing the amount of US data fed into the final destination API. In May, Naurt's parking spot and building entrance data covered just shy of 60 million address. Fast forward to November and over 155 million addresses have been catalogued across all 50 states.
The most difficult aspect of running a geocoder is acquiring data. For Naurt, parking spots and building entrances are the easy part. Addresses however, are not as easy. Good quality, complete address data is hard to come by so we’ve been busy collecting various datasets taken from the county, state, and nationwide level. This has presented us with a unique set of challenges we hadn’t yet faced rolling our technology out in the UK and Singapore; namely, that addresses can be wildly different even when they’re describing the same address.
The UK and Singapore essentially have single sources of high quality, standardised addresses. We found this just wasn’t the case in the US. Streets may include or exclude cardinal directions for no apparent reason, remove ordinals such as st, th or rd, and generally be shortened, Boulevard → Blvd or Mountain View → Mtn. View. This can lead to a single location having multiple different, correct addresses. For this reason, we developed a pipeline where addresses are sanitised, corrected, and standardised before being combined with our parking sport and building entrance data. The output from this pipeline is also fed into our accuracy metric which is returned with every geocoding result.
Address standardisation doesn’t just provide problems in the ingestion of data. It also makes it harder to find and rank suitable address matches when searching. As we mentioned in our previous blog, Naurt Update: Faster and Global, we’ve switched our search system to OpenSearch from Postgres & PgVector. The main reason for this was search latency increasing due to an additional 100 million American addresses. However, it’s also enabled us to be more intelligent with our full-text search as we’re now able to efficiently use synonyms, such as rd → road. It’s also helped us handle the sticky situations where an abbreviation could have multiple meanings, such as st → street or saint. Overall, the road to full US coverage has left us with a quicker, more accurate search not to mention the benefits of being able to quickly horizontally scale our system.
We realised early on that comprehensive coverage is the bedrock of all good geocoders. Often, we find a customer’s use of a geocoder is business critical - incorrect or unavailable geocoders for a delivery company would result in a lot of lost profits. This leads to complex systems where multiple geocoders are used either as backups, or substitutes depending on what region the delivery is in. At Naurt we believe the best way to ensure a system is robust is by keeping it simple. A single geocoder with comprehensive coverage therefore becomes the obvious choice.
A side effect of bad coverage is bad search results. If you’re searching for an address in the US against a list of 100 addresses, it’s almost guaranteed the most relevant address will be a bad match. On the other hand if you have hundreds of millions, it’s likely the most relevant address will be the correct one you’re looking for. No matter how many sanity checks you put in place to ensure the correct addresses are returned, the best way to avoid this problem is by increasing the density of address coverage.
With excellent availability in the U.S, UK, and Singapore Naurt is currently working with it’s partners to expand into continental Europe, Australia, and Canada. Alongside expanding coverage, Naurt continuously prioritises data accuracy, both in terms of the underlying data final destination and search accuracy. We’ll be looking to ensure partial address matching is more accurate as well as improving the rate at which we reject searches. Often a bad match can be worse than no match at all!