Tech
April 1, 2025
Deployments are scary, or at best, a slow process. At least, that is what I’m led to believe, if I listen to friends that work in all sorts of tech roles. One of them even has the job title of ‘deployment coordinator’. Once every release cycle of four weeks, he’ll stay up through the night to watch some Ansible playbooks crawl to completion, and maybe run some tests. These sound like a task that should be automated! Aren’t there plenty of deployment tools available? Still, deployments being scary and time-consuming seems to be the industry standard. But it doesn’t have to be that way.
Note: we wrote about automated deployments earlier. This blogpost is about our in-house developed deployment coordination software specifically.
Back when the Channable development team could still fit in a single room, we had a ‘deployment hat’ as a means of coordinating who was deploying. Whoever was wearing the hat was deploying, and nobody else was allowed to touch the deployment scripts at the same time. Of course, this approach doesn’t scale, but we kept doing something very similar to it for a long time. Until about two years ago, deployments would look something like this: a developer, wanting to deploy the fruits of their labor, would merge their PR. Then, they would go to the release coordination channel on our internal chat platform, and declare they would go and deploy their changes. If other developers had merged any changes after the previous release, they would be notified by this developer that their code would go live, and if asked if they perhaps needed to check anything on staging. Only when the other developers had responded would the developer actually deploy their code.
At the time, we believed that we had automated quite a bit of the deployment process. Creating a git tag, generating release notes from commit messages, building and uploading release artifacts, and pushing the release to the relevant environment(s) were all steps that were automated, or available from a single command. We released multiple versions per day, that isn’t too bad, right?
However, as the Engineering department grew, the process seemed to become more and more sluggish. Developers started to get scared or tired of deploying, and more and more PRs would be merged without a new release being crafted. We brought down CI wait times, improved deployment tooling, and made stricter rules about what code could be merged without a new release being created. However, the problems persisted.
We set off building a simple tool with a simple API and the ability to execute deployment scripts. We immediately realized the need for some sort of composable, configurable blocks that make up the deployment definitions; Examples are steps to execute Ansible commands, perform database migrations, and apply Terraform plans. Each deployment plan consists of some of these configurable steps.
When CI has finished building a new version, it will call the deployment API with information about the built release, which triggers a deployment if requested by the developer. Some of the information about the deployment would be inserted by our merge bot (Hoff[1]) after receiving the specific merge command from the developer. The tool would report successes and failures to our internal chat platform. This tool served well for deployment automation, but we quickly realized that we hadn't solved all of our deployment problems. Developers would not always use the new tool, either because it didn't support their workflow or because they didn't trust it. Communication would still be synchronous in the chat, and deployment velocity didn't increase significantly.
Because we continuously released the tool while building it, we discovered during development that what we needed was not (just) deployment automation, it was a tool to automate the coordination around deployments. We found that we needed the following:
We tried to find existing tools that fit those requirements, but we couldn't find any that fit, so we had to figure out our own solutions to these problems.
Quite early in the development process, we implemented a simple feature that turned out to be crucial to solving these problems. The feature in question was a pause button. The deployment automation tool was nothing more than a script running somewhere in our cluster, but being able to prevent the script from picking up new deployments made it feel a lot more reliable. Developers find deployments scary, and that is for good reason: the moment your code hits production, bugs in their code will start to affect customers. We built all sorts of checks into the tool to ensure that every deployment matches the expectations. And if it doesn’t? The deployment tool pauses itself, calls out to chat, and a human can resolve the issue in relative peace. This makes the tool feel reliable: it will never do something unexpected. Apart from automatic pauses, developers could also manually pause the tool, which gives additional control, and can be very helpful during stressful situations like outages.
After introducing this pause button, we quickly realized that we needed something else: a queue that developers could control. If the deployment tool was paused, deployments which would come in, would not be executed in order. Instead, we used an actual queue. Later, we exposed the control of this queue to the developers. This means that developers could now add deployments to both sides of the queue (priority deployments to the front), and remove deployments from the queue.
Once we had this in place, developers started to get on board rather quickly, even though we didn’t have much deployment automation yet (unsupported scenarios would just pause the queue). Feature requests for specific deployment workflows started pouring in over the course of months, and we implemented a whole slew of features we didn’t anticipate at first. Nowadays, we enjoy per-project queues, a beautiful UI (well, beauty is in the eye of the beholder), and automated database migrations.[2] Most deployments go through this tool, and almost no PR gets merged without being deployed immediately. You’d almost have to go out of your way to do that: a developer can enqueue a PR for merge and deploy with a single message to our merge bot.
The result is that we deploy more often, with less overhead. And when we say often, we mean it: we do about 30 deployments per day, with peaks up to 50.
Of course, no system is perfect, and our deployment coordination tool is no exception. Coordination of hotfixes and rollbacks is still messy, especially since the integration between our mergebot and the deployment coordination tool is not perfect. The mergebot doesn't pause when the deployment tool is paused, which can lead to a big buildup of undeployed releases in the queue. This is normally no problem, but if we want to apply a hotfix or revert a release, this can be problematic. Our current solution is to apply the hotfix or revert commit on master, and release the latest master. This means that all the changes in the in-between releases are deployed at once, instead of one-by-one. As a consequence, we have to ask the creators of each PR that was merged in-between if they want to check their code on staging, just like before we automated the deployment process. At least now this only happens when we really break things!
We started building a deployment automation tool, but somewhere along the way, we ended up with a deployment coordination tool. At some point, we stopped and looked around to see if we could find such software off-the-shelf. To our surprise, we couldn't find anything like it (even ignoring all of the integrations we did with our specific stack). So what if you want to solve deployment coordination problems in your organization? As far as we are concerned, you just need a queue and a pause button.
1: https://github.com/channable/hoff ↩
2: To make this safe and reliable, we drew inspiration from https://benchling.engineering/move-fast-and-migrate-things-how-we-automated-migrations-in-postgres-d60aba0fc3d4
↩
Are you interested in working at Channable? Check out our vacancy page to see if we have an open position that suits you!
Apply now