Naurt Update: Faster and Global

Author - Indigo Curnick

September 27, 2024
Updates

A big focus for Naurt the last few months has been response speed. We thought that the service was too slow, and also global response times could be very poor. We attacked this problem from two sides

  1. How quickly the servers can deal with a request, once received
  2. Deploying copies of the Naurt infrastructure around the world

Let’s look at what we did for the server response time first. We implemented three main features to speed this up

  • Introduced OpenSearch
  • Lambda Streaming Responses
  • Improved caching

Originally, Naurt only used one kind of database - Postgres. We were using Postgres to store billing and account data, and also the location data itself. Billing and account data is lightning fast in Postgres. Yet, we always had problems with the actual location data. We went through many, many possible designs for storing this. The final instantiation in Postgres was using PgVector. For a time we were able to get unbelievably fast results. Then we added another 150 million addresses, and that was no longer true. Not only that, but we could never get very satisfactory search accuracy. While we love Postgres, it’s not designed for full text search.

The solution here was to split our data over two databases. One for billing and account data in Postgres and the location data in OpenSearch. This has scaled much better to the hundreds of millions of addresses we now have. Plus the search speed and accuracy is much better. It’s hard to give a specific number for how much time this saved, as it can vary massively. We’ll see a broader breakdown soon.

We were also able to reduce the size of the Postgres instance to a smaller machine, since it was no longer responsible for any major search operations. The consequence of this was that, even accounting for the new OpenSearch cost, we were able to reduce our total database costs by about 50%.

In short, for full text search across a large number of documents, OpenSearch is faster, cheaper and more accurate than PgVector. In the end, it was also not as extravagant an engineering task as first imagined to implement this. The majority of the work only took a few days. We had also amassed a significant number of API tests too, which we could run automatically to verify that nothing broke.

Lambda is a brilliant AWS service. Since almost all of the work in the final destination API is done by the database, now OpenSearch, we don’t actually need much compute power in the server itself. We always wrote the Lambdas themselves in Rust for speed and for reliability. Generally, a Lambda only exist until it sends a response. But, a server can do some work after it has sent a response to a user. In a traditional server, you could spawn a thread to handle this work. For example, updating the API key usage. Lambda has only more recently started to support streaming responses, and this gives us the ability to respond to the user and then keep the Lambda alive afterwards to handle usage updates. We were able to shave off around 7ms per response from this change.

We also improved caching. We now use a Redis instance to help cache key usage and provide a buffer between the database and the Lambda services. We have another Lambda perform frequent synchronises between the Redis instance and Postgres.

What were the effects of this on response times? We did a small experiment to measure this. The methodology here was pretty simple. With an identical copy of the infrastructure with no other traffic we sent various requests per second for 5 minutes and measured the median response time. We also randomised the locations and addresses used to prevent the database from caching a small set of responses. A computer in the UK performed the tests. The infrastructure that handled the requests is hosted in London. In other words, this includes the internet time and is representative of usage in the UK.

Reverse geocoding response times were improved by up to 82% under load. Forward geocoding response times were improved by up to 66% under load.

We also improved global response times, specifically in the Asia-Pacific region. We now have a deployment in Singapore. Previously, these requests were routed to London, which means that customers in Asia-Pacific would frequently experience response times in excess of 500ms. A lot of the work which went into generally speeding up the service also helped with deploying the many regions. For example, splitting our data between Postgres and OpenSearch means we can have one Postgres instance which handles billing and accounts, and then create a copy of the data in every region we operate. Introducing more caching speeds up key validation in regions far from the billing database.

Asia-Pacific customers should now experience response times comparable to European customers.

Unfortunately, no system can scale forever. As traffic increases and makes more data available, the system will inevitably begin to lag again. Nevertheless, before releasing large amounts of data into production we always test the response times on an identical, independent system first to ensure we don’t cause delays for our users. We also continuously monitor response times and will intervene as they begin to slow.

Subscribe To Our Newsletter - Sleek X Webflow Template

Subscribe to our newsletter

Sign up at Naurt for product updates, and stay in the loop!

Thanks for subscribing to our newsletter
Oops! Something went wrong while submitting the form.