AI & Machine Learning

Docker’s AI Agent Fleet: How We Built a Virtual Team to Ship Faster

Docker's Coding Agent Sandboxes team built a Fleet of seven AI agent roles that test, triage, fix bugs, and write release notes using Claude Code skills — all running locally and in CI.

Published 2026-05-02 03:29:10 • 085878 Stack Staff

Docker's Coding Agent Sandboxes team has revolutionized their development process by creating a virtual team of AI agents. This Fleet of seven specialized roles autonomously tests, triages, fixes bugs, and writes release notes — all running in CI and on developers' laptops. Here are the key questions and answers about this innovative approach.

What is the Fleet at Docker’s Coding Agent Sandboxes team?

The Fleet is a collection of seven AI agent roles built to automate the testing, triaging, and maintenance of the sbx (Coding Agent Sandboxes) CLI tool. Each agent has a distinct persona — like a build engineer or tester — and uses Claude Code skills (markdown files) to understand its responsibilities and make decisions. The Fleet runs both locally for rapid iteration and in CI across macOS, Linux, and Windows. It handles tasks like exploratory testing, issue triage, release note generation, and even bug fixing. The goal is to accelerate shipping by offloading repetitive but judgment-based tasks to agents that can adapt to failures and unexpected conditions.

Docker’s AI Agent Fleet: How We Built a Virtual Team to Ship Faster — Source: www.docker.com

How do Claude Code skills differ from traditional scripts?

Traditional scripts are procedural: they run step-by-step and stop if an error occurs. Claude Code skills, on the other hand, are role descriptions. Think of them as a markdown file that says “You are the build engineer. Here’s what you know, your tools, and how you make decisions.” The agent uses this persona to reason about what to do. For example, if a test fails unexpectedly, a script halts, but an agent with a skill investigates the failure, looks for root causes, and decides how to proceed. This distinction is crucial because agents need judgment, not just instructions. The same skill file works identically on a laptop and in CI — no separate versions needed.

What does the “local first, CI second” design principle mean?

Every skill in the Fleet is first developed and tested locally on the developer’s machine. For example, when building the /cli-tester skill (an exploratory tester), the team invoked it in their terminal, watched it build binaries, run CLI commands, and report issues. They tweaked the skill in seconds per iteration, avoiding the slow commit-push-wait-read-logs cycle of CI-only development. Only after the skill worked reliably locally did they wire it into GitHub workflows. This approach makes debugging fast and natural — you see the agent think in real time. CI is then just another runtime that provides the environment, checks out code, and calls the same skill file. No translation, no separate version.

How does the Fleet integrate with CI workflows?

The Fleet’s CI integration is designed to be minimal. Each nightly workflow simply sets up the target OS (macOS, Linux, or Windows), checks out the codebase, and invokes the same skill file used locally. For instance, the /cli-tester skill runs on all three platforms using the exact same markdown definition. There is no separate “CI version” of any skill. The workflow handles environment preparation, but the agent itself executes autonomously — it decides which commands to run, how to interpret results, and whether to file bug reports or create PRs. This simplicity reduces maintenance overhead, because changes to a skill only need to be made in one place.



What types of tasks do the agent roles handle?
The seven agent roles cover a range of essential DevOps and quality tasks:

Exploratory testing — The /cli-tester skill builds binaries and exercises CLI commands across platforms, catching regressions and edge cases.
Release notes generation — An agent compiles changes and publishes release notes automatically after each deployment.
Issue triage — Agents scan the backlog, categorize bugs, prioritize them, and assign labels.
Bug fixing — Some agents autonomously create pull requests for trivial fixes (e.g., typos, documentation updates).
Sustained load testing — An agent runs long-duration tests to detect resource leaks and performance degradation.

All tasks are driven by the same skill-based philosophy: the agents use judgment, not rigid scripts, so they can adapt to unexpected failures.

Why did the team choose a role-based approach over automated scripts?
The team needed agents that could handle unexpected situations. Traditional scripts fail when an error occurs. A role-based approach, built on Claude Code skills, gives agents the ability to investigate and decide. For example, if a test fails due to a flaky network, a script would stop and wait for human intervention. A role-based agent might retry, check logs, or escalate differently. Moreover, roles make the system transparent: developers can read the skill file to understand what an agent will do. This also enables the “local first” workflow, because the same skill runs identically everywhere. The result is a fleet that is resilient, maintainable, and ships faster by offloading repetitive but nuanced tasks.