10 Key Improvements from Cloudflare's 'Fail Small' Initiative: A More Resilient Network

Over the past several months, Cloudflare has completed an intensive engineering effort code-named "Code Orange: Fail Small." This project aimed to eliminate the root causes of two major global outages in late 2025, ensuring that your traffic stays uninterrupted and your data remains secure. While resilience is never truly finished, the team has shipped concrete improvements that make our infrastructure harder to break, faster to recover, and more transparent when things do go wrong. Below are the ten most important changes you need to know about—how they work, why they matter, and what they mean for your experience on Cloudflare.

Safer Configuration Changes
Snapstone: Unified Progressive Rollout
Reducing the Blast Radius of Failures
Revised Break Glass Procedures
Enhanced Incident Management
Preventing Drift and Regression
Better Customer Communication
Continuous Improvement Culture
Cross-Team Adoption of Best Practices
Real-Time Health Monitoring Integration

1. Safer Configuration Changes with Health-Mediated Deployment

Previously, internal configuration changes could propagate instantly across Cloudflare's entire network, risking global impact if something went wrong. Now, high-risk configuration pipelines are identified and subjected to health-mediated deployment. This means changes are rolled out in small increments while real-time health metrics are monitored. If a metric degrades, the rollout pauses and automatically reverts—often before any customer traffic is affected. This approach, previously reserved for software releases, is now applied to configuration changes as well. For you, this translates to fewer surprise outages and a more stable experience, even when our teams are actively improving the network.

10 Key Improvements from Cloudflare's 'Fail Small' Initiative: A More Resilient Network — Source: blog.cloudflare.com

2. Snapstone: Unified Progressive Rollout for Config Changes

A major technical achievement is the new internal component Snapstone. This system bundles any unit of configuration—whether a data file, a control flag, or a policy update—into a package that can be released gradually with automatic health checks. Before Snapstone, each team had to build its own progressive rollout mechanism, leading to inconsistency. Now, Snapstone provides a unified platform that makes health-mediated deployment the default for all configuration changes. Its flexibility means it can adapt to future failure modes we haven't even imagined yet. For Cloudflare customers, this is a structural guarantee that errors will be caught early and fixed fast.

3. Reducing the Blast Radius of Failures

Even with safer rollouts, failures can still occur. The "Fail Small" initiative introduced architectural changes to limit the blast radius when something does break. By decomposing monolithic services into smaller, isolated components, and by implementing circuit breakers and rate limits, we ensure that a single misconfiguration or software bug affects only a tiny fraction of traffic. For your applications, this means that even in the worst case, only a minimal subset of requests might see an error, rather than a full global outage. We've also added automated failover mechanisms that reroute traffic around failing components within seconds.

4. Revised Break Glass Procedures for Emergency Access

Sometimes engineers need emergency access to bypass normal safeguards—a so-called "break glass" scenario. We've revised these procedures to add more stringent controls: now every break-glass action requires multiple approvals, logs every keystroke, and automatically triggers a post-mortem review. The goal is to preserve the ability to act quickly in genuine emergencies while eliminating the risk of accidental misconfiguration. For you, this means that even during a crisis, our teams follow safe practices that minimize collateral damage.

5. Enhanced Incident Management Protocols

Our incident management process has been overhauled to be more proactive and data-driven. New playbooks specify clear roles for incident commanders, scribes, and subject-matter experts. We've integrated real-time dashboards that show the health of every critical subsystem, and we conduct regular drills to keep teams sharp. This means when an incident does occur, we can diagnose and mitigate it faster than ever before. For your services, that translates to shorter downtimes and more precise internal communication about root causes.

6. Preventing Drift and Regression Over Time

Resilience isn't a one-time fix—it's a permanent state. To prevent "drift" (where temporary workarounds become permanent) and regression (where old bugs reappear), we've introduced automated configuration audits and CI/CD pipeline checks. Every change must pass security and reliability gates before reaching production. Additionally, we've implemented a new tool that compares current configurations against known-good baselines and alerts teams to any unauthorized changes. This ensures that improvements from "Fail Small" remain effective years into the future.

7. Better Customer Communication During Outages

Transparency is critical. We've redesigned our customer communication templates to provide earlier, more frequent, and more useful updates during incidents. Status pages now include timestamps, expected impact, and links to detailed post-mortems. Customers can also subscribe to granular alerts for specific services or regions. The goal is to make you feel informed, not ignored, during even minor disruptions. We've also trained our support teams to deliver consistent messaging across all channels.

8. Continuous Improvement Beyond Code Orange

The "Fail Small" project is not a finish line; it's a mindset. Cloudflare has embedded resilience reviews into every product development lifecycle. Every new feature now requires a documented failure-mode analysis before launch. We've also established a dedicated resilience engineering team that continually audits the network for weak points. This commitment means that even after the project's official completion, you will continue to see incremental improvements in reliability as we apply lessons learned to future changes.

9. Cross-Team Adoption of Best Practices

One of the key outcomes of "Fail Small" is that best practices once used only by certain teams—like progressive rollouts and automated rollbacks—are now mandated across all product teams. This includes the teams responsible for the November and December outages. Standardized tooling, shared playbooks, and regular cross-team reviews ensure that no team operates in isolation. For you, this means that every part of the Cloudflare network benefits from the same rigorous safety standards, reducing the likelihood of a single team's mistake causing a global outage.

10. Real-Time Health Monitoring Integration

Central to all the above improvements is a new layer of real-time health monitoring that feeds into Snapstone and incident management. This system tracks hundreds of metrics—latency, error rates, CPU usage, and more—and automatically correlates changes with configuration updates. If a metric crosses a threshold, an alert fires within seconds, and a rollback can begin automatically. This integration turns our network into a self-healing system. For you, it means that even if a misconfiguration slips through, the impact is minimal and short-lived.

The completion of "Code Orange: Fail Small" marks a significant milestone in Cloudflare's journey toward invulnerable infrastructure. But resilience is never a destination—it's a continuous effort. We've built new systems, revised old processes, and embedded a culture of safety across every team. The result is a stronger, more reliable network that learns from its mistakes and adapts automatically. As we move forward, you can expect fewer outages, faster recoveries, and clearer communication when issues arise. Thank you for trusting us with your traffic; we take that responsibility seriously, and these ten improvements are our promise to keep earning it.