Trace Analysis

TL;DR — Four Takeaways from 413K Agent Traces Test early, test often. The single strongest predictor of agent success is the fraction of early bash commands dedicated to testing. This is TDD for AI agents — and it works. Just like humans, agents need to concentrate as well. Agents that scatter edits across 3+ files early are far more likely to fail — a dose-response effect validated across all 3 dataset splits. The Single Responsibility Principle holds for agents. Agents that repeat commands are stuck. Identical bash commands in the early phase predict failure — a genuine behavioral signal, not a task-difficulty confound. Many human SWE “best practices” don’t transfer. View-before-edit, grep-before-edit, incremental TDD cycles — these intuitive principles are confounded or reversed for AI agents. Agents are not junior developers. Every day, thousands of AI agents attempt to solve real software engineering tasks. They read code, run tests, edit files, and submit patches. Each attempt leaves behind a detailed trace — a complete record of every tool call, every bash command, every file read and edit. ...