How Automating Agent Trajectory Analysis Transformed Our Development Workflow

By ⚡ min read

In the world of AI research, analyzing the performance of coding agents is both critical and time-consuming. I recently found myself caught in a repetitive cycle of reviewing thousands of agent trajectories, each a JSON file documenting an agent's decision-making steps while solving a task. Using GitHub Copilot, I could surface patterns and reduce the workload, but the process still required manual investigation. Driven by a desire to eliminate this intellectual toil, I created eval-agents, a tool that automates the analysis and enables my entire team to collaborate more effectively.

The Impetus for Automation

My primary responsibility involves evaluating coding agent performance against standardized benchmarks like TerminalBench2 and SWEBench-Pro. This requires digging through massive collections of trajectories—detailed logs that capture the agent's thoughts and actions for each task.

How Automating Agent Trajectory Analysis Transformed Our Development Workflow
Source: github.blog

Analyzing Agent Trajectories

Each task in a benchmark set produces its own trajectory file, often hundreds of lines of JSON code. Multiply that by dozens of tasks per benchmark and again by the numerous runs we conduct daily, and you end up with hundreds of thousands of lines of data to analyze. Manually reading through all of this is simply impossible.

The Repetitive Loop

My typical workflow involved using GitHub Copilot to identify patterns in the trajectories, then manually investigating those patterns to extract meaningful insights. While Copilot helped me reduce the lines I needed to read from hundreds of thousands to a few hundred, the loop itself remained repetitive. The engineer in me thought: I can automate this. That realization sparked the creation of eval-agents.

Building Eval-Agents

The core idea was to build a system that could automate the intellectual work of analyzing agent trajectories, making it accessible and shareable across the team.

Design Goals

I approached the project with three guiding principles:

  • Make agents easy to share and use – so that anyone on the team could leverage the automation.
  • Make it easy to author new agents – empowering peers to create custom analysis tools.
  • Make coding agents the primary vehicle for contributions – enabling a collaborative, agent-driven development workflow.

Sharing and Collaboration

These goals align closely with GitHub’s core values of collaboration and open source. My experience as an open-source maintainer for the GitHub CLI taught me the importance of making tools easy to adopt and extend. With eval-agents, I ensured that the agents could be version-controlled, shared via repositories, and run by anyone with minimal setup. Team members can now author their own agents to tackle specific analysis challenges, and the entire team benefits from a growing library of automation.

How Automating Agent Trajectory Analysis Transformed Our Development Workflow
Source: github.blog

Impact and Future

The results have been transformative. Instead of spending hours on manual pattern hunting, my colleagues and I can now run agents that automatically surface insights from benchmark runs. This has not only accelerated our research but also freed up time for more creative problem-solving.

Moreover, the agent-driven development approach has opened up new possibilities. We are no longer limited by individual capacity; the team collectively builds and maintains agents that continuously improve our analysis capabilities. As we expand the agent library, we anticipate even greater efficiency gains and deeper understanding of coding agent behavior.

This journey taught me that automation isn't just about removing drudgery—it's about enabling teams to collaborate at a higher level. By leveraging tools like GitHub Copilot and building upon them with our own agents, we have created a feedback loop where automation fuels innovation.

Recommended

Discover More

Go 1.26 Revolutionizes Code Modernization: Source-Level Inliner Now Part of 'go fix'Transmission Lines at Risk: A Practical Guide to Understanding Why Pausing Major Upgrades Is a Dangerous PolicyPreparing for Tomorrow's Jobs: Coursera's Latest AI and Skill-Building Programs ExplainedScience Spotlight: Fusion Milestones, Ocean Warming, and Brain MicroplasticsFrom API Recommendations to AI Coding Assistants: A Step-by-Step Guide to Reducing Developer Friction