WhatschatDocsEducation & Careers
Related
AI-Powered Manufacturing Takes Center Stage at Hannover Messe 2026What John Ternus as Apple CEO Means for Hardware EnthusiastsBreak Down Org Chart Silos: Why Design Managers and Lead Designers Must Embrace Overlap, Experts SayRevolutionizing Industry: AI-Driven Manufacturing at Hannover Messe 2026Markdown Mastery Now Non-Negotiable for GitHub Users, Experts Warn10 Essential IT Fundamentals You Must Know: From Hardware to DockerMastering Markdown on GitHub: A Beginner's GuideAI-Powered Manufacturing Takes Center Stage at Hannover Messe 2026

Strengthening Cloudflare's Network: Inside the Code Orange: Fail Small Initiative

Last updated: 2026-05-04 19:31:40 · Education & Careers

Cloudflare recently completed a major internal engineering project known as "Code Orange: Fail Small." This effort was designed to make the network more resilient, secure, and reliable, addressing the root causes of global outages that occurred in November and December 2025. Below, we answer key questions about the changes, what they mean for your traffic, and how Cloudflare is now better prepared to prevent and recover from failures.

What was the Code Orange: Fail Small project and why was it undertaken?

The project was an intensive engineering initiative over two and a half quarters, focused on improving Cloudflare's infrastructure after two significant global outages on November 18 and December 5, 2025. The goal was to prevent similar incidents by making the network more resilient and secure. It wasn't a one-time fix but a shift in how Cloudflare approaches reliability across its development lifecycle. The project targeted safer configuration changes, reducing failure impact, revising emergency procedures, improving incident management, preventing drift over time, and strengthening customer communication during outages.

Strengthening Cloudflare's Network: Inside the Code Orange: Fail Small Initiative
Source: blog.cloudflare.com

What key areas did the Code Orange: Fail Small project address?

The project focused on four main pillars: safer configuration changes, reducing the impact of failure, revising "break glass" procedures and incident management, and preventing drift and regressions. Additionally, Cloudflare improved how it communicates with customers during outages. These areas were chosen to directly tackle the weaknesses exposed by the 2025 outages. For example, configuration changes were previously deployed instantly across the network, which could cause widespread issues. Now, changes are rolled out progressively with health monitoring. Emergency access procedures were also updated to ensure rapid, safe intervention when needed.

What is Snapstone and how does it improve configuration changes?

Snapstone is a new internal system that brings health-mediated deployment to configuration changes. Previously, applying this methodology to config was possible but required significant per-team effort and wasn't consistently used. Snapstone bundles configuration changes into packages and releases them gradually, monitoring real-time health metrics. If a problem is detected, the system automatically rolls back the change before it affects customer traffic. This flexibility means Snapstone can handle any unit of configuration—whether a data file like the one causing the November outage or a control flag like in the December incident—making it a powerful tool for preventing future issues.

How does health-mediated deployment work and why is it important?

Health-mediated deployment is a methodology that ensures configuration changes are not applied instantly across the entire network. Instead, they are rolled out incrementally to a subset of servers or regions. During this rollout, observability tools continuously monitor system health and traffic metrics. If anything deviates from expected behavior, the deployment is halted and automatically reverted. This approach, already used for software releases, is now applied to configuration changes. It prevents dangerous deployments from reaching production and causing widespread outages, as seen in 2025. Snapstone makes this process easy and consistent across all relevant teams.

Strengthening Cloudflare's Network: Inside the Code Orange: Fail Small Initiative
Source: blog.cloudflare.com

What measures prevent drift and regressions over time?

To ensure improvements last, Cloudflare introduced processes that detect and prevent configuration drift and regressions. This includes automated checks that compare current configurations against baseline safe states, and alerting teams when anomalies appear. Regular audits and testing cycles are now built into the development pipeline. Additionally, the same health-mediated deployment principles are used for all changes, meaning any future modification goes through the same rigorous validation. This creates a self-correcting system that maintains resilience without requiring manual oversight.

How did Cloudflare improve communication during outages?

Cloudflare revised its incident management and customer communication protocols. The new procedures ensure faster, more transparent updates during any network disruption. Communication is now structured to provide clear, actionable information about the ongoing issue, expected resolution time, and steps being taken. Internal roles and responsibilities for communication have been clarified, and tools have been upgraded to deliver updates more reliably. This builds trust and helps customers understand what to expect during an incident.

How does this work make Cloudflare's network stronger for customers?

For most customers, the immediate impact is invisible but significant: internal configuration changes no longer reach the network instantly. Instead, they are deployed progressively with real-time health monitoring, catching problems before affecting traffic. This reduces the risk of outages caused by human error or misconfigurations. The improved resilience means fewer disruptions, faster recovery when issues do occur, and better communication throughout. Ultimately, Cloudflare's network is now more reliable and secure, giving customers confidence that their traffic is protected by a system designed to fail small and recover quickly.