Channable

Tech

Postgres performance lessons - part 1: missing indexes and random keys

September 16, 2024

At Channable we recently released a dynamic image editor to empower our users to batch modify the images in their feed. It comes with an editor that supports placeholders for dynamic elements like a text box with the item's price, or a photo of the item sourced from e.g. an image_link field.

One of the services enabling this feature is the image downloader. Its task is providing a cached copy of a recent download of the image at the provided URL. This gives the rest of our stack fast and reliable access to these images without relying on external servers and having to deal with related failures. Similarly, by capturing failures and storing them centrally, we can give the customer useful feedback about why we aren't able to process some of their images.

Essentially, the service only needs to perform two major tasks: downloading images and scheduling re-downloads. The latter to make sure that we will eventually pick up on changes as well as retry in case of transient errors. Downloading itself is familiar territory because after many years of downloading feeds at scale, we've built up a significant body of knowledge on how to most effectively use curl. Similarly, building a scheduler to schedule work over multiple machines is a task we've grown accustomed to, e.g. managing export jobs to external APIs like Amazon, bol, Shopify, etc.

However, shortly after the beta release we found ourselves with an ever-growing backlog of images to download, and customers having to wait hours before they could finally use all image related features. The culprit? A database that was only barely responsive as a result of dramatically increased load.

In short, our solution worked but did not scale. In this article we describe how we ended up solving the scalability issues and what we learned along the way.

1. Background: building a sound job queue in Postgres

It is important for scalability that our system is able to grow horizontally (more machines) rather than just vertically (a more powerful machine). In order to share work in such a distributed system, we employ a database as a parallel queue from which each worker could grab images to download whenever it saw fit. Some might consider this controversial use case for a relational database, as there are many alternative pieces of software that claim to be more optimized for such tasks like Redis or RabbitMQ. In spite of that, we've found that more often than not, Postgres is best suited for our parallel queuing needs for multiple reasons:

  1. Most services already need a database anyway, so it doesn't require any new external dependencies or tools.
  2. Inspectability and consistency comes cheap; everybody can just run a few simple queries to understand what is going on.
  3. Customizability, Postgres will allow you to easily deviate from a strict FIFO model.
  4. Finally, developing features, debugging issues locally and testing on CI is greatly simplified because running a local database is already a well established part of our workflow.

To get a better idea of how the system overall interacts with the database - and thus the queue - observe the diagram below. It shows an example of the architecture when there are 3 instances running. Each instance is a process with 1 polling thread that attempts to fill a bounded channel from which a pool of worker threads eagerly take work. The number of threads in the threadpool can exceed the number of physical cores to maximize efficiency as downloading is mostly IO bound.

system overview
System overview - A high level sketch of the system when there are 3 instances running.

With the overall structure of the service laid out, let us zoom in on the database schema. The table containing the work looks like the following:

CREATE TABLE image_links (
  id INT8 NOT NULL PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  image_link TEXT NOT NULL,
  download_after TIMESTAMPTZ,
  queued_at TIMESTAMPTZ
);

-- Allows for fast access to first N elements with the lowest value
CREATE INDEX image_links_download_after ON image_links(download_after);
  • download_after indicates where the image link lives in the schedule.
    If download_after < NOW() then the work should be claimed by one of the instances the next time it polls for new work.
    If download_after is NULL, the image link is unscheduled and will never be downloaded again.
    This is a useful terminal state in case we've unsuccessfully retried to download an image many times.

  • queued_at acts as a mutex between multiple consumers (NOT NULLLOCKED).
    The timestamp is to allow the detection of 'stuck' work. If an image link has been queued for more than a few minutes for example, we might assume that a worker has somehow shutdown ungracefully and that the image link has become stuck.

Now having a table with work in it, let's have our instances claim some work for their respective threadpools. This approach is called 'pull-based work scheduling'. Given that obtaining work has some overhead (like network communication and query planning), and the task of downloading an image is usually not a particularly long-running task, we should amortize that overhead by grabbing a bunch of work at the same time. For this example we aim to grab 1000 image links at a time. The most optimal value is something that you need to tune experimentally and depends on how much time it takes to poll, how fast the work is processed, how many threads are in the threadpool, etc. To claim the links we first isolate them in a sub-query and subsequently update them, returning the id of the updated records:

UPDATE image_links
SET queued_at = NOW()
FROM (
  SELECT id as inner_id
  FROM image_links
  WHERE download_after < NOW()
  -- don't grab links that are already queued
  AND queued_at IS NULL
  -- we want fairness by processing the backlog in order
  ORDER BY download_after
  LIMIT 1000
  FOR UPDATE SKIP LOCKED
)
WHERE inner_id = image_links.id
RETURNING image_links.id

In case you are wondering why the locking clause is necessary, we now show how its omission can lead to a race condition that causes one image link to be queued by multiple instances when running in Postgres' default isolation level, READ COMMITTED. We first make sure that the database contains a dummy value:

INSERT INTO image_links (image_link, download_after, queued_at)
VALUES ('example.com', NOW(), NULL);

We then open two database sessions side by side, each representing a concurrently running instance of the service, and execute the following commands:

race condition

The crux of the problem is that Postgres does not synchronize reads by default. As a result, session 2 is allowed to first also SELECT the old value as part of its inner query, and only when it executes the UPDATE in the outer query, does it wait for session 1 to release its lock and then simply override any changes already made. The solution is to obtain a FOR UPDATE lock while SELECT-ing the value, ensuring that different sessions cannot be reading the row at the same time. Finally, the SKIP LOCKED clause ensures that it never has to wait on another transaction that is currently holding a lock on an image link and also reduces any avenues for deadlocks.

In summary, building a concurrent queue using Postgres - while taking some deliberate thought into locking strategies - is a simple and effective way to get your distributed system up and running quickly and reliably. Furthermore, this design is capable of handling an impressive amount of throughput, as we will discover throughout the article.

2. Project kickoff: the situation in numbers

During development, a stress test was run with 1 million images, and it was found that 1000 downloads per second was achievable. Of course, these images were hosted under ideal conditions, but it gave us the confidence that the design itself could to handle large volumes of throughput. However, once up and running the metrics showed that we were merely achieving ~30 downloads per second, which was insufficient to keep the backlog contained:

backlog december
Backlog over time - stalled refers to the number of image links whose download_after is in the past and should be attempted as soon as possible. In an ideal world
this number stays as close to zero as possible. Queued refers to image links that have been picked up by a poller thread but are not yet processed by the threadpool.

Note that Stalled refers to the number of image links whose download_after is in the past and should be attempted as soon as possible. In an ideal world this number stays as close to zero as possible.

Moreover, the database had been showing serious signs of distress. To fix the immediate outages we chose to simply upgrade the machine's CPU. This was enough such that it would maintain operations, but the metrics were still worrying:

cpu usage december
Database metrics over time - from top to bottom: cpu usage, memory pressure, disk usage - all three at worrying high levels.

CPU usage was always high, disk utilization was basically always at a constant 100%, and the memory pressure was an order of magnitude higher compared to any of our other databases. In other words, high time to start addressing these issues!

3. A tale of a single missing index and a wrong assumption

From the backlog metrics it became clear that our overall downloading speed was not bottlenecked by a slow or throttled network; the downloader threads mostly reported themselves as inactive due to a lack of available work. We came to this conclusion because the yellow line indicating how many image links are currently queued, was often zero. But of course work was plenty, so what was causing it to not reach the downloader threads?

Before we started this project, a suspect was already named: the rate limiter. This was an extension of the concurrent queue I described in the previous chapter that was supposed to prevent us from spamming customer's servers with too many requests. The Haskell pseudocode that every instance of downloader service executed looked like this:

-- Spawn 64 downloader threads and setup communication
channel <- newTQueueIO
forM_ [0..63] $ \_ -> do
    spawnDownloaderThread channel

-- Polling loop
forever $ do
    work <- pollFromBacklog (NrItems 1000)
    targetRate <- getMaxDownloadSpeedBasedOnErrorRate
    let (downloadNow, deferForLater) = classifyWork work targetRate

    -- Send work to the downloaders threads
    forM_ downloadNow $ \link -> do
        atomically (writeTQueue channel link)

    -- Reschedule deferred work
    forM_ deferForLater $ \link -> do
        execSql (unlines [
            "UPDATE image_links",
            "SET download_after = NOW() + RANDOM() * interval '1 min'",
            "WHERE id = $1"
        ]) link

So effectively we had built a funnel that would let just a bit of work through and would push the rest into the near future. Kind of like a snowplow that had raised its plow only slightly above the road:

snowplow

The theory was simple: pushing the work forward required so many updates on the table that it had become the bottleneck. In other words, our snowplow was pushing too much snow and could hardly drive anymore!

It was a compelling theory; it explained why the database usage was high (on the CPU as well as the disk) which in turn suggested that the polling thread was mostly waiting on Postgres to reschedule work instead of providing for the worker threads. But we made a classic mistake: we had conflated theory with facts, and we did not revisit our assumptions until much later. The theory had been around from before this project started, and we had already taken some measures to try and mitigate the problem:

  1. We made sure that the targetRate would be higher so that less work needed to be deferred (raising the snowplow).
  2. We increased the random interval over which we rescheduled the deferred work (flattening the mountain of snow).

Yet the throughput of the system remained mostly unchanged.It was only after measuring how long the rescheduling took - with merely the intention to put a number to this thing that was obviously bad - that we discovered that we spent hardly any time on the these updates.

This finding finally helped us crawl out of the pit of assumptions and made us actively start searching for alternative explanation. Coincidentally, our database admin had just set up pgbadger, which would be able to help us identify performance problems.
There it was, sticking out like a sore thumb as the very first entry on the list of "Slow queries":

queryduration
UPDATE image_links SET queued_at = ...23s616ms

It was the polling query we discussed at length, but how was that possible? After all, we specifically made sure to create an index on that download_after and verified that it was being used.

Treacherously, our setup had subtly changed since. image_links was given an extra timestamp column called referenced_at, which tracked the last time an image link was encountered in an import. This would allow us to manage the lifetime of the image links by deleting any and all that had not been referenced in the last 7 days. But the real issue was that we decided that any image link that had not been referenced for 36 hours could already be excluded for downloading, reducing the amount of work we potentially had to do. So our polling query had become:

UPDATE image_links
SET queued_at = NOW()
FROM selection
WHERE image_links.id = selection.id
RETURNING image_links.id
UPDATE image_links
SET queued_at = NOW()
FROM (
  SELECT id as inner_id
  FROM image_links
  WHERE download_after < NOW()
  AND referenced_at > NOW() - interval '36 hours' -- newly added clause
  AND queued_at IS NULL
  ORDER BY download_after
  LIMIT 1000
  FOR UPDATE SKIP LOCKED
)
WHERE inner_id = image_links.id
RETURNING image_links.id;

Inadvertently, we had not considered how this would drastically change the performance of the query since we had not built an index on referenced_at. As a result, Postgres' query planner decided that the best thing it could do was a sequential scan over the entire table.

Postgres generally cannot effectively use two separate indexes for a particular sub-query, so we needed a new composite index on both download_after and referenced_at to eliminate the costly sequential scan. This improved the situation so considerably, that we finally got the backlog under control:

backlog nosedive
Backlog metrics over time - showing the moment the index was added, the backlog was handled rapidly and remained generally very low.

The database server itself was still very preoccupied, however:

cpu during nosedive
Database cpu usage over time - showing a minor reduction of iowait as stall reason, but not any significant decrease in overall usage.

This was concerning as the image editor feature had only been released for about three months, and we were expecting a significant influx of customers over the coming months; we needed more headroom to scale. But at least the system was now able to take on the current demand, and we gained some important things:

  1. The peace of mind to research and make better informed decisions.
  2. The realization that the rate limiter - while far from ideal - was not currently a bottleneck.
  3. pgbadger was going to make it much easier to identify bottlenecks.
  4. Analyzing queries had become a first class citizen in our development workflow.

4. Random keys considered harmful

Similar to the previously unmentioned referenced_at column, image_links had another critical difference from the example in the introduction. Namely, the primary key was not an auto incrementing 8 byte integer, but rather the SHA256 hash of the image link stored as a BYTEA.

Why was this done? After all, given that an image link is unique in our system, it is already a candidate to be the primary key. Then again, an image link is not a particularly small bit of data. In some cases, it can be a piece of text with hundreds of characters. Using that as a key would mean that whenever we refer to an image link in another table, this sizeable chunk of data has to be duplicated. Therefore, we chose the SHA256 of the image link as a key because it was much smaller than most image links but could also be computed offline. This meant that we could directly find records that refer to an image link without actually having to join with the image link table. But we were not yet aware of the trade-off we had just made: a lack of data locality.

Postgres stores records inserted close to each other in time, close to each other in memory. Similarly, records that were inserted close to each other in memory are also much more likely to be queried close to each other in time, which is why caching pages is such an effective optimization. As a result, an index on a column that contains completely 'random' values - such as a SHA256 - will have to store these coherent records in incoherent locations. Therefore, querying an index with random keys is detrimental to various caching optimizations and will drastically reduce performance of all sorts of operations. Additionally, the performance of the index scan operation is compromised further by increased lock contention. An in depth explanation of that phenomenon can be found in this enlightening blogpost on cybertec by Ants Aasma.

When we migrated our primary key - as well as all referencing foreign keys - to an autogenerated sequential INT8, we could observe a significant drop in query times and cpu usage. Moreover, the 'cache hypothesis' as it were, is supported by the fact that after the migration of each key, the disk usage went down considerably:

disk usage dropping
Database disk utilization over time - showing a convincing drop in disk usage as the result of migrating to smaller, sequential keys.

Keep in mind that the complete effects of such a migration are only fully observable if you completely repack the table and indexes by running VACUUM FULL. In a production environment, where a long table lock is usually not an option, we use the pgrepack utility instead to achieve the same effect.

Another contributing factor to improved query speed is the reduced cost of comparing INT8 over BYTEA. Especially because BYTEA are arbitrary length, comparisons require some form of looping rather than just a single comparison of 64-bit integers. Determining precisely how much this played a role in the decrease of CPU usage is rather difficult, as our metrics do not differentiate between reasons for CPU stalls.
Regardless, every time we migrated one of the keys, the database's CPU got happier:

cpu drops with int8 key
Database CPU usage overtime - lining this graph up with the disk utilization graph above, it becomes apparent that cpu usage also dropped as a result of smaller, sequential keys. Note that the drop around 02/28 was due to unrelated optimization that we will cover in a future post.

A nice side effect is that tables with a relation to image_links, especially many-to-many relations, now require significantly less storage. After all, each foreign key was now only 8 bytes instead of 33 (32 bytes for the SHA256 + 1 byte for the length of the BYTEA).

The lesson learned is that randomly generated keys should be avoided if possible. If it is really necessary for you to generate keys offline, then at least pick a generator that can be naturally sorted by a time element - like UUID V6 - to maintain as much cache coherence as possible. That said, it is important to acknowledge that the severity of the negative effects of random keys are proportional to how much data is in the table. So if you are expecting merely a couple of thousand entries, it simply isn't worth thinking about. That said, you should also consider the size of any other tables, especially many-to-many relations, in which the random key occurs as a foreign key.

5. End of part 1

This was the first half of a two-part story about the Postgres performance lessons we've learned building Channable's image editor. To summarize, we first established a brief overview of the system and showed how to build a correct concurrent queue in Postgres. We then explained how a missing index was causing a serious performance regression and how we prevent such issues going forward. Lastly, we discussed the negative impact of random keys.

Now that we've established a lot of the context and addressed the more obvious performance bottlenecks, we are ready to dive into more advanced concepts and optimizations in the next edition.

Continue to part 2.

avatar
Hugo PetersSoftware Development

We are hiring

Are you interested in working at Channable? Check out our vacancy page to see if we have an open position that suits you!

Apply now