Yesterday's Top Poster

Preparing for the worst: Our core database failover test

  • Thread starter Thread starter Matheus Fernandes, Matthew Binshtok
  • Start date Start date
Many engineering teams have disaster recovery plans. But unless those plans are regularly exercised on production workloads, they don’t mean much. Real resilience comes from verifying that systems remain stable under pressure. Not just in theory, but in practice.

On July 24, 2025, we successfully performed a full production failover of our core control-plane database from Azure West US to East US 2 with zero customer impact.

This was a test across all control-plane traffic: every API request, every background job, every deployment and build operation. Preview and development traffic routing was affected, though our production CDN traffic, served by a separate globally-replicated DynamoDB architecture, remained completely isolated and unaffected across our 19 regions.

This operation was a deliberate, high-stakes exercise. We wanted to ensure that if the primary region became unavailable, our systems could continue functioning with minimal disruption. The result: a successful failover with zero customer downtime, no degraded performance in production, and no postmortem needed.

Read more

Continue reading...
 
Back
Top