How Cloudflare's 'Code Orange: Fail Small' Made the Network Stronger for You

By

Cloudflare recently completed an intensive engineering initiative known internally as “Code Orange: Fail Small”. The goal was to build a more resilient, secure, and reliable infrastructure for every customer. By focusing on safer configuration changes, reducing the impact of failures, revising incident management procedures, and preventing long-term drift, Cloudflare tackled the root causes of the global outages on November 18, 2025 and December 5, 2025. Below, we answer key questions about what changed and how it benefits you.

What is ‘Code Orange: Fail Small’ and why did Cloudflare need it?

‘Code Orange: Fail Small’ was an internal project spanning more than two quarters of engineering work. Its purpose was to make Cloudflare’s network more resilient by ensuring that when something goes wrong, the impact remains small and contained. The project was prompted by two global outages in late 2025 — one on November 18 caused by a problematic configuration data file, and another on December 5 related to a control flag in the global configuration system. These incidents highlighted the need for safer, more gradual configuration rollout mechanisms, as well as improved incident communication. The project is now complete, and the new systems are in place to prevent similar widespread failures.

How Cloudflare's 'Code Orange: Fail Small' Made the Network Stronger for You
Source: blog.cloudflare.com

How did Cloudflare make configuration changes safer?

Cloudflare introduced health-mediated deployment for all configuration changes that affect customer traffic. Instead of rolling out changes instantly across the entire network, teams now release them progressively, with real-time health monitoring at every step. If the monitoring tools detect a problem — such as increased error rates or latency spikes — the system automatically reverts the change before it can impact your traffic. To make this possible at scale, Cloudflare built a new internal component called Snapstone (see next question). This shift means that high-risk configuration pipelines are now managed with the same caution as software releases, dramatically reducing the chance of a bad configuration affecting all customers at once.

What is Snapstone and how does it work?

Snapstone is a new system that brings health-mediated deployment to configuration changes. Before Snapstone, applying progressive rollout and auto-rollback to configs was possible but required significant per-team effort and was not consistently used. Snapstone provides a unified platform where any team can define a configuration unit — whether it’s a data file, a control flag, or any other configuration artifact — and then release it gradually. The system monitors health metrics in real time and automatically rolls back if anomalies appear. Snapstone is flexible, allowing teams to update the monitoring criteria as their services evolve. By making it easy to adopt safe deployment practices, Snapstone closes a key gap in Cloudflare’s resilience strategy.

How does health-mediated deployment benefit customers?

For most customers, the most visible benefit is increased reliability. Configuration changes that used to be applied network-wide in seconds now roll out in stages. If a particular node or region shows signs of trouble, the rollout pauses and the change is reversed, so the problem never spreads to your traffic. This means fewer unplanned outages, shorter recovery times when issues do occur, and a generally more predictable service. Additionally, Cloudflare now has better observability tools that can catch problems early — sometimes even before any customer-facing impact occurs. The result is a stronger, more stable network that you can rely on for your critical applications.

What other areas did Cloudflare improve besides configuration changes?

While configuration safety was the biggest focus, the project also addressed several other areas:

How Cloudflare's 'Code Orange: Fail Small' Made the Network Stronger for You
Source: blog.cloudflare.com

How does this work prevent future outages like those in November and December 2025?

The November and December 2025 outages were both caused by configuration changes that went wrong and affected the entire network instantly. With the new health-mediated deployment system powered by Snapstone, those specific change types would no longer be deployed globally in one shot. Instead, they would be rolled out gradually, and the monitoring would detect the early symptoms — such as a spike in errors or latency — and automatically revert the change before it could cause widespread disruption. Additionally, the improved ‘break glass’ procedures and incident management mean that even if a change bypasses normal safeguards, the team can respond faster and more effectively. While no system can guarantee zero outages, ‘Code Orange: Fail Small’ directly addresses the root causes of those two major incidents.

What measures ensure that these improvements remain in place over time?

Cloudflare has introduced automated regression detection and drift prevention measures. Configuration pipelines are now subject to ongoing health checks, and any deviation from expected stability thresholds triggers alerts. Teams are also required to apply the same health-mediated deployment principles to all new configuration workflows. Regular audits and stress tests validate that the Snapstone system and other safeguards are still operating correctly. Because security and reliability are never “job done,” Cloudflare has embedded these practices into its development lifecycle, so every new feature or configuration change automatically inherits the protections. This ensures that the network stays strong as it grows and evolves.

Related Articles

Recommended

Discover More

Motorola Quietly Overtakes Samsung in Foldable Phone Market, New Data ShowsIO Interactive's James Bond Game Reveals Groundbreaking Spy Missions and Unforgettable ChaseInside the Courtroom Shocker: Musk vs. Altman Trial Takes an Unexpected TurnONDO Token Surges 68% as US Regulatory Shift Ignites Real-World Asset Tokenization RallyFirst Fossil Fuel Transition Summit Ends with National Roadmap Commitments, New Tools to Curb Subsidies