Uncovering a Hidden ClickHouse Bottleneck That Slowed Cloudflare's Billing Pipeline
Cloudflare's daily billing pipeline, which processes hundreds of millions of dollars in usage revenue, suddenly slowed after a migration. The usual suspects—I/O, memory, rows scanned—showed no issues. This article explains the discovery of a hidden bottleneck deep within ClickHouse's internals and the three patches that resolved it. Below, we answer key questions about the incident, the systems involved, and the solutions applied.
What caused the sudden slowdown in Cloudflare's billing pipeline?
The slowdown occurred after a migration of the Ready-Analytics system, which powers daily aggregation jobs in ClickHouse. These jobs are critical for generating accurate invoices for Cloudflare's customers. When they slowed, invoices became difficult to reconcile, posing major downstream risks to revenue, fraud detection, and other systems. The team initially checked standard performance metrics like I/O, memory usage, and rows scanned, but all appeared normal. The bottleneck was not in those areas but deep within ClickHouse's internal query execution—specifically related to how the database handled indexID-based filtering across a primary key, a nuance that only revealed itself under heavy concurrent load.

How does Cloudflare use ClickHouse for billing and analytics?
Cloudflare is a heavy user of ClickHouse, an open-source OLAP database. They store over a hundred petabytes of data across dozens of clusters. For billing, millions of calls are made daily to ClickHouse to calculate usage-based charges for products like CDN, DNS, and security services. The system also feeds fraud detection and other critical pipelines. Data is organized using a custom system called "Ready-Analytics," which streams records into a single massive table. Key fields include a namespace (for different teams), an indexID (for sorting), and a timestamp. The primary key is (namespace, indexID, timestamp), which optimizes query performance for individual namespaces.
What is the Ready-Analytics system and its core design?
Launched in early 2022, Ready-Analytics simplifies onboarding for internal teams. Instead of designing custom tables, teams stream data into one giant table with a standard schema: 20 float fields, 20 string fields, a timestamp, and an indexID. The indexID is a string that becomes part of the primary key, allowing each namespace to have its data sorted optimally for its queries. This design proved popular—by December 2024, the system held over 2 PiB of data and ingested millions of rows per second. However, it had a critical flaw: a single retention policy that dropped partitions after 31 days, which didn't accommodate teams needing longer or shorter retention periods.
What was the hidden bottleneck in ClickHouse that caused the slowdown?
After extensive investigation, the root cause was traced to ClickHouse’s internal handling of primary key indexes under high concurrency. When multiple queries filtered by different indexID values, the database struggled to efficiently narrow down the relevant parts (data chunks). The bottleneck wasn't in raw I/O or memory but in the granule skipping mechanism—a key part of ClickHouse’s query optimization. Under load, the cost of evaluating which granules to read became disproportionate, especially because the indexID field was a string in a composite primary key. The migration had changed the data distribution pattern, exacerbating this inefficiency.

How did Cloudflare fix the bottleneck with three patches?
The team developed three targeted patches. The first optimized the granule skipping logic to more efficiently use bloom filters and min-max indexes for string fields like indexID. The second patch improved concurrency management by reducing lock contention in the part selection code path. The third patch adjusted the data sorting strategy within each partition, grouping rows by namespace and indexID more tightly to reduce the number of granules that needed scanning. These changes were applied incrementally, resulting in a dramatic performance recovery—query times returned to normal and the billing pipeline resumed its timely execution.
What were the downstream implications of the slowdown and the fix?
The slowdown had immediate consequences: invoices were delayed, making it harder to reconcile revenue, and fraud detection systems faced latency. Given that the pipeline powers hundreds of millions of dollars in usage revenue, even a few hours of delay could cascade into billing disputes and operational chaos. After applying the three patches, the system stabilized, ensuring daily aggregation jobs finish within the required window. The incident also spurred a redesign of the retention system—moving from a one-size-fits-all 31-day policy to a per-namespace approach, which reduced the need for workarounds and improved overall flexibility.
What lessons did Cloudflare learn from this incident?
This experience highlights that performance bottlenecks can hide in unexpected layers, even when standard metrics appear healthy. The team learned the importance of deep profiling of ClickHouse’s internal query execution, especially for composite primary keys with string fields. They also reinforced the need for granular retention policies to avoid systemic workarounds. Moving forward, Cloudflare plan to invest in more comprehensive monitoring of internal query phases, such as granule skipping efficiency, and to profile changes under realistic concurrency loads. The three patches have been contributed upstream where applicable, benefiting the broader ClickHouse community.
Related Articles
- From Chromebook to Googlebook: Key Lessons for Google's Next Generation Laptop
- Create Your Own Autonomous OSINT Agent with Python and Claude's Tool Use API
- What's Next for AWS: Key 2026 Announcements in Agentic AI and Productivity
- Scorpions Forge Their Weapons with Metal in Evolutionary Arms Race, Study Reveals
- Hands-Free Work Lights: Ditch the Flashlight and Work Smarter
- How to Safeguard University Research and Graduate Admissions During Federal Policy Shifts
- Kazakhstan Extends Partnership with Coursera to Boost Digital and AI Skills in Higher Education
- Kazakhstan Advances Higher Education with Renewed Coursera Partnership