Meta's AI-Driven Efficiency: How Unified Agents Optimize Hyperscale Performance

Meta's Capacity Efficiency Program leverages a unified AI agent platform that automates the detection and resolution of performance issues across its hyperscale infrastructure. By encoding the expertise of senior efficiency engineers into reusable, composable skills, these agents save power and free engineers to focus on innovation rather than manual fixes. The program combines defensive tools like FBDetect with offensive AI-assisted opportunity resolution, recovering hundreds of megawatts of power and compressing investigation times dramatically. Below, we answer key questions about how this system works and its impact.

What is Meta's Capacity Efficiency Program?

Meta's Capacity Efficiency Program is a strategic initiative designed to optimize performance across the company's massive infrastructure, which serves over 3 billion people. Given that even a 0.1% performance regression can lead to significant additional power consumption at this scale, the program focuses on both offensive and defensive approaches. Offensively, it proactively searches for code changes that can make existing systems more efficient. Defensively, it monitors production to quickly detect regressions, identify the root cause down to a specific pull request, and deploy fixes. The program has traditionally relied on human engineers to resolve issues, but with the growing scale, it introduced a unified AI agent platform that automates much of this work. These agents encode domain expertise into reusable skills, enabling faster resolution and scaling efficiency without proportionally increasing headcount.

Meta's AI-Driven Efficiency: How Unified Agents Optimize Hyperscale Performance — Source: engineering.fb.com

How do AI agents help find and fix performance issues?

The AI agent platform at Meta encodes the domain expertise of senior efficiency engineers into standardized, reusable skills. These agents operate within a unified tool interface, allowing them to automatically investigate and resolve performance issues across the infrastructure. For defense, agents work with FBDetect to catch thousands of regressions weekly and automate mitigation, reducing wasted power. For offense, they assist in identifying optimization opportunities and creating ready-to-review pull requests. By compressing hours of manual investigation into just 30 minutes, AI agents dramatically speed up the process. This not only recovers hundreds of megawatts of power—enough to power hundreds of thousands of homes—but also frees engineers from repetitive tasks so they can innovate on new products. The platform is designed to be composable, meaning skills can be combined and reused across different scenarios, making the system scalable and adaptable.

What are the offensive and defensive approaches to efficiency?

Efficiency at hyperscale requires a two-pronged strategy: offense and defense. On the offensive side, Meta proactively searches for opportunities to make existing systems more efficient through code changes. AI agents assist by analyzing vast amounts of data to identify potential optimizations, then automatically generate pull requests for review. On the defensive side, the focus is on catching regressions that slip into production. FBDetect, Meta's in-house regression detection tool, monitors resource usage and alerts teams to unexpected changes. Once a regression is found, AI agents root-cause it to a specific pull request and deploy mitigations, often automatically. This dual approach ensures that efficiency gains are continuously pursued while losses from regressions are minimized. Together, they form a self-sustaining cycle where AI handles the long tail of issues, allowing the program to grow capacity without proportionally expanding the team.

How does FBDetect work for defense?

FBDetect is Meta's in-house regression detection tool that plays a critical role in the defensive side of the Capacity Efficiency Program. It continuously monitors resource usage across the hyperscale infrastructure, looking for anomalies that indicate a performance regression. When a regression is detected—such as a sudden increase in CPU or memory consumption—FBDetect automatically flags it and works in tandem with AI agents. These agents quickly analyze the data to root-cause the issue down to a specific pull request, using encoded domain expertise. The system then deploys mitigations automatically, often within minutes, without requiring human intervention. This process catches thousands of regressions weekly. Without AI automation, each regression could take hours of manual investigation, leading to compounding wasted megawatts across the fleet. By accelerating detection and resolution, FBDetect and the AI platform help maintain efficiency at scale.

How does AI-assisted opportunity resolution work for offense?

On the offensive side, AI-assisted opportunity resolution involves using the unified agent platform to proactively find ways to make Meta's systems more efficient. The agents analyze performance data, code patterns, and resource usage to identify potential optimizations—such as algorithm improvements, better data structures, or configuration tweaks. They then encode these findings into a reusable skill and generate a ready-to-review pull request. This process expands to more product areas every half (six-month period), handling a growing volume of wins that human engineers would never have time to address manually. By automating the entire path from opportunity identification to pull request, the AI platform enables the Capacity Efficiency Program to deliver more megawatts of savings. The goal is to create a self-sustaining engine where AI handles the long tail of optimization tasks, freeing engineers to focus on higher-level innovation.

What impact has this program had on power savings?

The AI-driven Capacity Efficiency Program has recovered hundreds of megawatts of power—enough to power hundreds of thousands of American homes for a year. This is achieved through a combination of defensive regression mitigation and offensive optimization. On the defensive side, faster resolution of regressions means fewer wasted megawatts compound across the fleet. On the offensive side, AI-assisted opportunity delivery expands annually, handling more wins without additional headcount. The program's ability to scale MW delivery without proportionally growing the team is a key metric of its success. As the platform continues to evolve, the expectation is that these savings will grow further, making Meta's infrastructure increasingly efficient and sustainable.

How much time does AI save compared to manual investigation?

AI agents compress what would be approximately 10 hours of manual investigation into about 30 minutes—a 20x improvement. This dramatic time savings comes from encoding the domain expertise of senior engineers into reusable, composable skills that agents can apply instantly. For example, when FBDetect flags a regression, the AI agent automatically analyzes logs, traces, and metrics to pinpoint the root cause and suggest a mitigation, whereas a human engineer would need to manually sift through data and run experiments. Similarly, for offensive opportunities, the AI can scan thousands of code paths and generate pull requests in minutes. This frees engineers to focus on creative problem-solving and new product development, rather than repetitive debugging. The result is a more efficient use of human talent and faster overall performance improvements across Meta's infrastructure.

What is the future goal for this AI system?

The end goal for Meta's AI agent platform is to create a self-sustaining efficiency engine where AI handles the long tail of performance issues. This means that the system would automatically detect, investigate, and resolve the majority of regressions and optimization opportunities without human intervention. Engineers would only step in for complex or novel edge cases. The platform is designed to scale across more product areas, adapting encoded expertise to new contexts. As the AI learns from each interaction, it becomes more efficient, reducing the need for manual oversight. Ultimately, Meta aims to decouple capacity growth from headcount growth, allowing the infrastructure to expand sustainably while maintaining high performance and low power consumption.