Why agentic evaluation datasets are fundamentally different — and how to build them right
Multi-step agent evaluation requires a new approach to data design. We break down the key differences from single-turn annotation and share lessons from dozens of agentic training projects.
Single-turn vs. multi-step: a different game entirely
Single-turn annotation is well-understood: show a model a prompt, evaluate the response, done. Agentic evaluation is different because the model makes sequential decisions across many steps, with each choice affecting what comes next. A single wrong tool call five steps in can derail an entire task chain.
The challenge of partial credit
In single-turn tasks, scoring is relatively binary — correct or not, helpful or not. In agentic tasks, a model may complete 8 of 10 sub-goals correctly. Building rubrics that meaningfully award partial credit without inflating scores for fundamentally flawed trajectories is one of the hardest problems we've had to solve.
Designing tasks that expose real failure modes
Most synthetic agentic benchmarks are too clean. Real-world agent tasks involve ambiguous instructions, missing context, and tools that sometimes fail. Our dataset design deliberately introduces these messiness factors because a model that only works in sterile environments isn't ready for production.
Lessons from the field
After dozens of agentic annotation projects, the single biggest lesson is this: invest heavily in task design before you touch annotation. The quality of your evaluation is bounded by the quality of your task specification. Weak tasks produce misleading data no matter how skilled your annotators are.