Build Your Own Evaluation Agent with GitHub Copilot: A Step-by-Step Guide

By

Introduction

Are you tired of manually sifting through hundreds of thousands of lines of code to analyze how your AI coding agents perform? As an AI researcher at GitHub, I faced this exact challenge when evaluating agents on benchmarks like TerminalBench2 and SWEBench-Pro. The repetitive task of reading trajectories — JSON files that capture every thought and action an agent takes — was a perfect candidate for automation. Using GitHub Copilot, I created a tool called eval-agents that not only automated my own analysis but also enabled my entire team to build custom solutions. In this guide, I'll walk you through the same process so you can create your own agent-driven analysis tool. By the end, you'll have a reusable system that saves hours of intellectual toil.

Build Your Own Evaluation Agent with GitHub Copilot: A Step-by-Step Guide
Source: github.blog

What You Need

Step-by-Step Guide

Step 1: Identify the Repetitive Analysis Pattern

The first step is to pinpoint the exact task you want to automate. In my case, I was analyzing agent trajectories to find common failure modes or successful strategies. Each trajectory is a JSON file with hundreds of lines — and I had dozens of such files per benchmark run. The repetitive pattern was: load a trajectory, search for specific actions or errors, and compile statistics. Write down your own repetitive loop. For example:

This pattern becomes the foundation for your agent.

Step 2: Use GitHub Copilot to Explore Data Patterns

Before building a full automation, explore your data with Copilot’s help. In VS Code, open a few trajectory files and use the Copilot Chat or inline suggestions to write quick exploratory scripts. For instance, ask Copilot: “Write a Python script that reads all JSON files in a folder and prints the first action of each trajectory.” This gives you a feel for the data structure and helps you discover patterns. Copilot can also suggest regex patterns for extracting specific information. Save these snippets — they’ll be the building blocks of your agent.

Step 3: Design Your Evaluation Agent

Now, design the agent that will automate your pattern. An agent is simply a script that performs a series of steps autonomously. Define the inputs (trajectory files), processing logic (pattern detection), and outputs (summaries or reports). Use Copilot to brainstorm by describing your design in comments, e.g., # This agent should: 1. Load each trajectory 2. Extract tool calls 3. Count errors 4. Output a CSV. Copilot will generate code blocks that you can adapt. Ensure your agent is modular so it can be reused or extended later.

Step 4: Implement the Agent with Copilot

Start coding with Copilot by your side. Create a new Python file and begin typing your agent’s skeleton. Copilot will suggest function signatures, loops, and data processing logic. Let it generate the bulk of the code while you guide it with clear variable names and comments. For example, type:

def analyze_trajectory(filepath):
    """Extract key metrics from a single trajectory JSON."""
    # Copilot will fill in the rest

Iterate quickly: as you accept suggestions, test with a sample file. Use Copilot’s inline chat to fix errors or optimize performance. The goal is to get a working prototype in minutes.

Build Your Own Evaluation Agent with GitHub Copilot: A Step-by-Step Guide
Source: github.blog

Step 5: Test and Refine the Agent

Run your agent on a small set of trajectories first. Check that the output matches what you expected. If something is off, highlight the problematic code and ask Copilot to debug it — for example, “This function returns None when the file is missing; add error handling.” Refine the agent to handle edge cases like empty trajectories or malformed JSON. You can also create unit tests with Copilot’s assistance by describing the test cases. Once it works on sample data, run it on your full dataset.

Step 6: Package and Share the Agent

To make your agent easy to share and use (like I did with eval-agents), create a GitHub repository. Structure the repo with a clear README, a src/ folder for the agent code, and a tests/ folder. Use Copilot to generate the README by describing the agent’s purpose and usage. Add a requirements.txt for dependencies. Consider making the agent configurable via command-line arguments or a config file. This allows teammates to run it on their own data without modifying the code. Push your repo and invite collaborators.

Step 7: Enable the Team to Author New Agents

The real power comes when others can contribute their own agents. In your repository, create a template for new agents. Use Copilot to document the template with comments and examples. Teach your team to fork the repo, copy the template, and customize it with Copilot’s help. Encourage them to share improvements via pull requests. I found that this approach turned my team into active contributors — they built agents for specific benchmarks or custom metrics. The result: a thriving ecosystem of analysis tools.

Tips for Success

By following these steps, you can transform tedious analysis into an automated, collaborative process. You might even find yourself shifting from manual reviewer to tool builder — just like I did. Happy automating!

Related Articles

Recommended

Discover More

Prepersonalization Workshop: The Critical Missing Step in AI-Driven Product DesignHow to Use Linux Mint's HWE ISOs for Enhanced Hardware SupportGit 2.54: Introducing 'git history' for Painless Commit RewritesExploring the Latest Developments in Open Source: April 30, 2026 LWN Edition10 Key Insights: Why Bank of America Says GTA 6 Should Cost $80 and Reshape Game Pricing