System Design Lessons from a Discord Voice Outage

As someone who is as interested in the organizational forces of failures as the technical ones, I find studying incidents — and root cause analysis generally — highly meaningful. They expose causes at multiple layers of the stack: there's the trigger, the design choices that make it catastrophic, the observability gaps that increase time-to-detect, operational gaps that slow down recovery. But also in the bigger organizational elements that create the environment in which they happen.

This is a week-long deep dive into a Discord incident where a routine Kubernetes config change took down their voice systems for about three hours. It's an attempt to use it as a lens into seeing how distributed systems are designed, where they're vulnerable, and how they fail at scale.

Read on.

What Happened

Discord's Realtime Infrastructure team was migrating Elixir services to Kubernetes. To reduce CPU utilization, they decided to vertically scale pods — fewer pods, more resources each. When deployed, Kubernetes terminated 50% of pods in one zone immediately due to the reduced replica count. A safety check that was supposed to facilitate graceful handoffs consumed the entire termination grace period without completing those handoffs. Kubernetes force-killed the pods ungracefully.

Since the sessions service ran across three equally balanced zones, this took out 17% of all Discord sessions simultaneously.

That drop fired a chain reaction through other systems, each failing for distinct reasons. Engineers spent three hours fighting fires. Voice calls were unavailable for most of that time.

Cascading Failures

At the core of the problem was cascading failures.

Cascading failures are scenarios where one failure triggers more failures until the whole system collapses, often in unplanned ways. Specifically, the failure of some components increases the load on surviving components, pushing them toward failures. The cascade is load-driven.

Two conditions make a system vulnerable to such failures:

Running near saturation: When the surviving systems are running at near saturation, an increased load causes them to go above their allowed/planned capacity. If your cluster is running at 70% capacity under normal conditions, it can probably absorb a 30% node loss. If it's running at 90%, it cannot.

Insufficient N+X redundancy. Redundancy is defined as having more than one copy of an entity that does the same thing — either a database, a micro-service, or a cache — as a way to make a system fault tolerant. But the question isn't just about more, but also how much more. In other words, it requires asking how much failure the system should be able to tolerate, and how much more capacity that requires in terms of redundancy.

Your system should be sized to handle normal load plus the redistributed load from X failed nodes. If you have 15 instances and each holds roughly 1/15th of the traffic, losing 3 means the remaining 12 each absorb an extra 25% on top of their current load. Whether they survive that depends on their headroom.

This makes defining X, the number of redundant instances, really important, which in turn requires asking how many components are likely to fail together, which is interesting because we often assume that in a distributed system each instance is independent of the others. In reality, they share failure boundaries in subtle ways, both physically and logically. AWS architecture guidelines offer a good mental model for the hierarchy to keep in mind: process → container → node → rack → AZ → region → cloud provider. Then there's the logical domains — deployment unit, shard, and shared dependency.

The Discord incident is a good example of a logical failure domain. The reduced pod count caused an abrupt change in cluster topology in a single deployment region.

Redundancy and Its Nuances

The layered redundancy insight from the previous section means that you can have a system that's redundant at the macro level while having a single point of failure at the micro level, which is one of the most insidious lessons in this incident.

Discord had 15 voice syncer instances. But inside each instance, every outgoing HTTPS connection to an SFU (Selective Forwarding Unit) had to pass through two single-threaded supervisor processes: a Holster.Pool DynamicSupervisor and a gun connection supervisor.

Since the shared, non-redundant resourcse existed in each instance, adding more instances would have given more instances with the same bottleneck replicated in each one of them.

What makes any single-threaded system a bottleneck is the amount of time it spends on a task before relinquishing control. In this case, what made the supervisors take more time on each task was a pattern called selective receive. It's Erlang's way of allowing a process to pick specific messages out of its mailbox based on pattern matching, leaving non-matching messages in the queue until a matching clause is found.

When an Erlang supervisor spawns a child process, internally it sends the spawn instruction, then waits for an ack from the new child before proceeding. That wait is selective receive. It's looking for a particular reply among whatever else might be in the queue.

In Discord's case, the supervisor's mailbox was being flooded with thousands of connection requests (the reconnection requests from the failed sessions). Each time it needed to spawn a connection and wait for the ACK, it had to scan through all those pending requests to find the ACK. Discord's engineers found that at ~100k messages in the mailbox, a single selective receive scan adds roughly 1ms. At 1 million messages, each spawn takes so long that requests pile up faster than they're served, making every subsequent spawn even slower.

The fix was PartitionSupervisor — instead of one supervisor, N independent supervisors each handling 1/N of the load in parallel. A flood in one partition can no longer stall the others. Each partition supervisor is still itself a GenServer process. But gun supervisor was still an SPOF. To solve this, the Discord team moved the gun connection process lifecycle management directly into Holster.Pool, eliminating the second bottleneck.

This solution builds redundancy at the process level, specifically, active-active redundancy, because all processes are at work at the same time. But it also follows another design principle: bulkhead. We'll come back to bulkhead later in the post, with more nuances.

The Actor Model

While selective receive is Erlang's implementation of how a process consumes messages from others, the underlying pattern which necessitates reading messages is the actor model.

An actor is a process with three properties: its own private state, communication exclusively through message passing, and one message processed at a time. The benefit it gives you is that you can reason about each actor in isolation. In a traditional concurrent system, shared mutable state means any thread can change any value at any time, which makes it harder to reason about correctness. Erlang's GenServer is its generic implementation of the model. Java's Akka framework is another.

But it's not without trade-offs. The "one message at a time" property is both the feature and the constraint.

In most systems this doesn't matter because the actor handles messages fast enough that the queue stays shallow. It becomes a problem when either the work per message increases, or messages arrive faster than the actor can process them. In Discord's case, both happened at once: each spawn took longer as the mailbox grew (the selective receive scan gets slower with queue depth), and the reconnection flood meant new requests arrived faster than the queue could drain.

PartitionSupervisor doesn't give any single actor the ability to process messages concurrently, it just creates more actors. This is comparable to horizontal scaling at the process level.

Creating actors with a strict requirement of communicating between each other using messages creates the perfect environment for another failure mode: high fan-out.

The Fan-out Pattern

We talked about how failures in some components cause an increase in load on others and about the two characteristics that make a system vulnerable to it. But the increase in load to the other components of the system itself can be of two broad categories: the same load that was handled by the failing components, redistributed, or new work that gets created due to other components failing.

A barista quitting at a busy time in a café creates more work for the remaining baristas, but also means more work for the manager to deal with frustrated customers.

The first one is obvious. Let's look at the latter.

Process monitors is Erlang's way of coordinating GenServer processes. Any failing process sends a {:DOWN} signal to the process monitoring it. When 17% of the session service pods failed, given how much context and state were stored in the session systems at Discord, it "generated a flurry of {:DOWN} signals throughout the cluster", flooding the mailboxes of the processes monitoring them.

This is a pattern, a failure mode, that becomes evident in any system with asynchronous message-passing between components.

A celebrity posting a tweet or a video fans out to all their millions of subscribers. A highly awaited event like a new season of Office or Stranger Things means a high volume of marketing emails going out at the same time.

In system design problems we typically talk about high fan-out at the macro level, like in Kafka, but in this incident, high fan-out happened at the process level - processes consuming messages from each process they monitored, some within the same instance, some in remote instances.

Each service/instance runs thousands of processes, and process monitors are designed to be N-to-N. Which means if hundreds of processes in the same instance all monitor the same remote process, they each get notified separately, almost at the same instant.

The most conceptually intuitive fix is coalescing at the fan-out boundary. Coalescing is a pattern of, well, coalescing multiple messages into one. Rather than every event propagating individually, an intermediary aggregates them before delivery. Instead of 10,000 individual {:DOWN} signals hitting downstream processes simultaneously, a proxy accumulates them and delivers a batch. The cost is latency.

This is what Discord built with ZenMonitor. ZenMonitor inserts a single local proxy per remote node: rather than each local process monitoring remote processes directly, the proxy does it on their behalf. When the remote node fails, the proxy absorbs the notification and fans it out locally in controlled batches.

Coalescing is also the pattern used to defend against a cache stampede: when multiple requests all miss the same cache key and trigger database lookups all at once. Request coalescing at the caching layer means that all but one requests wait while the cache does one database lookup.

Stateless and Stateful Systems

When in system design you think about a system failing, and what to do about it, one of the things to ask yourself is whether the system is stateless or stateful. The root cause, in Discord's case, wasn't the Kubernetes config change in isolation. It was what that config change did to a specific kind of system — the stateful session services.

Services that are stateless(or those that keep their state externally) keep no meaningful state in process memory. All durable state lives in a database, cache, or object store, outside the process. Terminating a pod in this model is effectively free. Since the state lives elsewhere, the pod is disposable. This is the model most web API services follow, and it's what makes Kubernetes deployments effortless for those services.

In in-memory stateful services, the state lives in process memory, often because of the business needs and constraints. Either the data is too dynamic, or too latency-sensitive to be able to afford a round-trip to an external store on every operation. Discord's sessions layer is a good example, as their engineering team described, "each host runs thousands of in-memory stateful processes — session state, presence, guild memberships, voice state." They were the source of truth.

When you terminate one of these pods, you're evicting potentially thousands of pieces of live state information. Unless those processes have explicitly handed off their state to another node before the pod dies, that state is gone, which has a direct impact on user experience.

Discord built the entity count monitor to solve this. Each pod runs a monitor that tracks how many live entity processes are running on it. But this didn't quite help during the incident because the entity monitor operated at the application-level, so Kubernetes had no awareness of it, which meant that it ran its own lifecycle management and terminated the pods when it deemed fit.

Discord's fix was to make K8s aware of it, but before we look into that, let's understand how K8s handles pod terminations.

Pod Termination Lifecycle in Kubernetes

The termination sequence in Kubernetes looks like this:

1. Deletion request. A kubectl command, a deployment rollout, a scale-down, or a controller triggers a DELETE on the pod object at the Kubernetes API server.

2. Pod transitions to Terminating. The API server marks the pod as Terminating. The moment this happens, the pod is removed from Service Endpoints (so new traffic stops routing to it via kube-proxy), and the pod's lifecycle hooks begin executing.

3. Endpoint propagation. kube-proxy on each node updates its iptables or ipvs rules to remove the terminating pod from the routing table. This propagation is asynchronous — it takes a non-zero and unpredictable amount of time to reach all nodes. During this window, the pod may have already received SIGTERM but is still receiving live traffic. The preStop hook exists partly to compensate for this race.

4. preStop hook. If a lifecycle.preStop is defined, Kubernetes executes it before sending SIGTERM. A common pattern is simply sleep 15, or an HTTP GET to a local endpoint. The hook runs synchronously: SIGTERM is not sent until it exits. This makes preStop useful for two things: letting endpoint propagation complete before the process starts shutting down connections, and performing any explicit de-registration from upstream services.

5. SIGTERM. The container's main process (PID 1) receives SIGTERM. This is the signal that graceful shutdown should begin: drain in-flight work, close connections, flush buffered state, de-register from service discovery. In Discord's case, it should also trigger the entity handoff sequence — transferring thousands of in-memory processes to other nodes before the pod dies.

6. terminationGracePeriodSeconds countdown. The clock starts when Kubernetes initiates termination. Discord had built the entity count monitor to ensure Kubernetes killed only the pods that had completely handed off their state. But a safety check designed to wait for other in-flight events prevented the handoff from even starting, which caused them to exceed the grace period.

7. SIGKILL. When the grace period expires, kubelet sends SIGKILL to any remaining processes. This is unconditional and uninterruptible. This is what Discord's pods hit. State that hadn't been handed off was lost.

Now, the fix.

Admission Webhooks

Entity counting and handoff happen during the termination lifecycle and at the application level.

Discord's fix was to create a webhook that intercepts any scale-down request for their Elixir workloads and checks whether all affected pods have entity counts at zero. If any pod still holds live processes, the request is rejected. The scale-down does not happen until the condition is met.

Admission webhooks operate before the termination lifecycle even begins.

Compare that to increasing terminationGracePeriodSeconds or adding a preStop hook. Those operate within the termination lifecycle and buy time, but they are still at the mercy of the ticking clock. The admission webhook means the termination clock never starts ticking until it's safe.

State Transfer Patterns

The fundamental idea behind state transfer is externalizing the state, but where exactly it's externalized distinguishes the various approaches.

The most intuitive approach is writing every state mutation synchronously to an external store — Redis, a database — before acknowledging the operation. Shutdown becomes trivial, since the store already has the authoritative copy of everything, and a new pod picks up where the old one left off. The obvious cost is write latency. Every operation now carries a network round trip. It works well for state that changes infrequently, or where that latency is acceptable. This essentially makes the system stateless.

For systems that can't afford that, the alternative is a replica pod — a peer that receives a real-time stream of mutations. On SIGTERM, the primary promotes the replica, updates the service registry so traffic routes there, then drains its own connections. The in-memory nature of the replica means state transfer is fast, but it introduces complexity: you need replica lag monitoring, promotion logic, and protection against split-brain scenarios where both primary and replica believe they're authoritative.

A third path is event log replay. Every mutation is written as an event to an append-only log. The in-memory state is a materialized view derived from replaying those events. On crash or restart, a new process replays from the last snapshot plus the delta. Recovery time is bounded by how frequently you snapshot. The overhead is replay time on startup and the infrastructure needed to maintain the log.

The Heartbeat Pattern and TTLs

Discord's voice syncers used etcd for service discovery, announcing their health periodically with a 60-second TTL(Heartbeat pattern).

If a node doesn't announce itself within the TTL window, other nodes assume it's gone and stop routing traffic to it. This lets the system automatically clean up crashed, orphaned, or stale nodes without any explicit de-registration. The TTL determines how quickly the system reacts to failures.

In Discord's case, the same bottlenecked supervisor processes that handled outgoing SFU connections also handled etcd health announcements. When the supervisors' mailboxes were overwhelemed with connection requests, heartbeats couldn't get through. Voice syncer instances missed their TTL windows, fell out of the etcd registry, and their traffic redistributed to the remaining healthy instances, which got more load and hit the same fate. Another instance of cascading failures.

This highlights the difference between the thing that does the work and the thing that says whether or not it can do the work. In some cases, both might have to give the same answer. In other cases, not necessarily.

Separation of Concerns, Single Responsibility, and Bulkheads

The coupling Discord described — heartbeats and SFU connections sharing the same supervisor processes — is a violation of a couple of things at once.

Separation of concerns. Outbound media connections to SFUs are data-plane work. In simpler terms, they are the work. They perform the actual business of relaying voice. Service registration with etcd is control-plane work, or the thing that tells whether it they can do the work. Routing both through the same supervisor means the control plane is coupled with the data plane.

Single responsibility principle is another angle of looking at it. The Holster/gun supervisor had two distinct responsibilities. Responsibilities that share a process share its failure modes.

Another is something we briefly touched on before: the Bulkhead pattern.

The fix with PartitionSupervisor addresses the load dimension of this: split the pool so a flood in one partition can't stall the others. So it introduced separation of failure boundaries on the load axis.

But the more fundamental problem is that heartbeating and connection pooling probably shouldn't share supervisors at all. A dedicated supervisor for health registration, isolated from the connection pool, would have kept beating even as the data plane saturated. The instance wouldn't have dropped from the ring.

So with the bulkhead pattern, there's partitioning of the load, and there's partitioning of the concern, which the principles of separation of concerns and single-responsibility principle speak to.

The Thundering Herd and Mitigation

When Discord tried to restart the voice syncer cluster, they all rushed to recreate their state at once. This is the thundering herd problem — a large number of processes simultaneously hammering a system, causing a well-functioning system to fail, or preventing a recovering one from recovering.

The standard client-side answer is exponential backoff with jitter: rather than all clients retrying at the same moment, each waits a random interval drawn from an exponentially growing window. The jitter breaks the synchrony.

On the server side, rate limiting is the most common technique. Discord had rate limits in place, but they hadn't been re-tuned after migrating from VMs to Kubernetes with a higher pod count. The per-host limit was set against the old VM architecture's lower pod count, so it was too permissive. Explicitly lowering syncer spawn limits across all call-owning services was one of the things that finally let restarts hold.

But this highlights the nuances of rate-limiting. They're usually tuned to a certain state of the world with assumptions about the characteristics of the system - users, clients, instances/nodes, and so on. In this case, per-instance limits, which were tuned to the reality of the VM-based environment, became stale and inaccurate to mirror the K8s cluster, where each pod could handle less than a VM instance.

Load shedding is another server-side technique: dropping requests when the system is overwhelmed, rather than queuing them indefinitely. The trade-off here is a system that rejects some requests and stays alive is more useful and one that accepts everything and collapses.

Rate-limiting and load-shedding seem identical, but there are important differences. Conceptually, load shedding is a survival mechanism, while rate limiting is primarily a fairness enforcement technique. Rate-limiting is generally client-aware: the server tells the clients that it's rejecting the requests along with hints about the quota to adhere to and when to retry. Load shedding just drops the requests without informing the clients.

Global rate limits can be used as a load shedding technique, but it's primarily used as a fairness enforcement tool, per user, per device, IP, region or whatever.

Which brings us to the next pattern. If rate limiting gives clients some signal about what's happening and what to do about it and load shedding doesn't, how do the clients handle it?

Using the circuit breaker pattern, another client-side technique. An upstream caller monitors the error rate from a downstream service. When errors exceed a threshold, the circuit "opens" — the caller stops sending requests, giving the downstream time to recover. After a timeout, it enters "half-open" state, sending a trickle of probe requests. If those succeed, the circuit closes.

Feedback Loops and Control Lag

Two patterns from the previous section, load shedding and circuit breaker, are both reactive mechanisms. Autoscaling is another example of it. They continuously monitor a certain dimension, and act based on whether or not they exceed a certain threshold. At the core of it is a feedback loop: an action happens, the impact of the action is assessed, the next steps are determined. Their effectiveness depends on how fast they execute relative to how fast the system is changing. If the system degrades faster than the feedback loop can respond, the correction happens too late.

That's what happened during the incident: a significant change like dropping of 50% pods in a single region(17% of all sessions) hardly gave any time for the reactive measures to kick in on time. Specifically, because the A/V infrastructure was still on VMs, scaling the voice syncers cluster was somewhat manual, which means increased control lag.

Conclusion

There are more foundational concepts and patterns to be learned from this, like the ideas of immutable infrastructure and policy as code and turns out you can actually learn a full-semester worth of distributed systems lessons from a single incident, but I decided to stop here. Onto the next incident analysis.