Agentic AI systems are redefining what's possible with large language models. From autonomous coding agents that write, run, and debug software to enterprise AI workflows that chain tool calls across multiple APIs, modern agents operate across multi-step environments where a single wrong decision compounds into cascading failure. Traditional NLP benchmarks — built around isolated prompt-response pairs — are completely blind to this.
Over the past two years, Diraflow has built agentic evaluation datasets for frontier AI labs developing advanced LLM agents. Here is what the work has taught us about why trajectory-level data is so different to build, and what it takes to do it right.
The single-turn illusion
Standard training datasets present a prompt and expect a response. This framing is so deeply embedded in ML infrastructure that it shapes not just what we measure, but what we think good performance looks like. An agent that produces a correct-looking output on a clean, static benchmark may fail completely when deployed in an environment where it has to take actions, observe their effects, and decide what to do next.
The failure is not usually in the final generation. It's in the steps before it — the agent's ability to navigate to the relevant context, interpret ambiguous intermediate states, recover from errors, and resist the temptation to hallucinate a shortcut to the finish line.
What most datasets measure
- One prompt, one response — no environmental feedback
- Context is pre-populated; agent never needs to navigate
- Correct output is binary — no partial credit for sound reasoning
- No recovery signal — agent can't observe and iterate
- Task is fully specified — ambiguity is treated as an error
- Measures generation quality, not problem-solving capability
What agentic datasets require
- Multi-step sequences with real environment state transitions
- Agent must locate relevant context before acting on it
- Partial credit for correct reasoning; subgoal-level scoring
- Execution feedback — agent runs code, reads output, iterates
- Underspecified tasks that require interpretation and inference
- Measures the full problem-solving process, not just the final output
What a real agentic trajectory looks like
Below is a simplified annotated trajectory from our dataset — a software engineering task where the agent must diagnose and fix a failing test. Notice how the interesting information lives in the intermediate steps, not the final answer.
A static benchmark would give this agent full marks for the final answer. A trajectory-level evaluation gives it credit for steps 1, 2, and 4 — and penalises step 3, flagging it as a regression-introducing edit. That distinction is the signal that actually improves agent behaviour during training.
The five properties that define high-quality agentic data
Environment fidelity
Low-fidelity mocks train brittle policies. Agents evaluated in simplified environments learn to exploit the simplification, not to reason. Realistic tool interfaces — real bash, real API call patterns, real file system state — are not optional.
Subgoal-level scoring
Binary success/failure destroys training signal. An agent that correctly diagnoses a bug but introduces a regression during the fix deserves a different reward signal than one that gave up in step one. Reward structures need to reflect the quality of the reasoning process, not just the terminal state.
Multiple valid trajectories
Two expert annotators solving the same task often take completely different paths — different tool sequences, different diagnostic strategies, different intermediate representations. Both are correct. A dataset that accepts only one path teaches agents that there is a canonical procedure. There isn't.
Adversarial validation
Agents find shortcuts. Any reward structure that can be hacked will be hacked during training. Every environment in our dataset library is red-teamed by a separate team specifically looking for ways to achieve high scores through exploits rather than genuine task completion.
Execution during evaluation
Assessing code that was never run is assessing plausibility, not correctness. Our evaluation pipelines run agents inside sandboxed execution contexts — real code, real tool calls, real failure messages — and score the full execution trajectory, not predicted outputs.
How we build it: the Diraflow pipeline
We combine expert annotators — former software engineers, security researchers, domain specialists — with programmatic sandbox tracing. Annotators work inside instrumented environments where every keystroke, tool invocation, and environmental observation is recorded. We then cluster successful trajectories, identify common failure modes, and use those clusters to generate targeted synthetic variations that cover the long tail of edge cases natural collection misses.
// Trajectory annotation format (simplified)
{
"task": "Fix the failing test in auth_service.py",
"initial_state": "snapshot_hash_abc123",
"expected_final_state": "all_tests_pass",
"positive_trajectories": [
["pytest -v", "cat auth_service.py", "targeted_edit", "pytest"],
["find . -name '*.py'", "grep -n token", "git diff", "patch"]
],
"reward_structure": {
"all_tests_pass": 1.0,
"no_regressions_introduced": 0.5,
"edit_minimality": 0.3,
"correct_diagnostic_step": 0.2
},
"anti_shortcut_checks": ["no_delete_tests", "no_hardcode_expected"]
}
"The first time we evaluated our agent on Diraflow's agentic dataset, we discovered it was blindly retrying failed API calls 11 times before giving up — a behaviour that never appeared in static unit tests. That single finding changed our entire training strategy." — ML Lead, large frontier AI lab (name withheld)
IAA in trajectory annotation is more complex than in single-turn labelling. Two annotators might produce trajectories with no step in common but identical final outcomes — both are valid positives. We measure agreement on three dimensions independently: final state quality, subgoal completion rate, and strategy efficiency. Overall trajectory-level IAA (κ) on our SWE tasks sits at 0.71 — lower than factual annotation but appropriate for the inherent diversity of valid problem-solving approaches.
What this means for teams building agents
The era of single-turn benchmarks as a proxy for agent capability is ending. Leaderboard scores on static datasets tell you whether your model can produce plausible-looking outputs. They tell you almost nothing about whether it can operate in an environment, adapt to feedback, and complete tasks that require sustained multi-step reasoning.
- Invest in environment infrastructure first. The quality of your agentic evaluation is bounded by the fidelity of your environments. This is infrastructure work, not annotation work — budget for it accordingly.
- Design reward structures before collecting data. Retrofitting subgoal scoring onto existing trajectory data is painful. Define what partial success looks like for each task type before a single trajectory is collected.
- Accept trajectory diversity as signal, not noise. Multiple valid paths to the same outcome are a feature of the domain, not a quality problem. Your annotation process should capture that diversity, not flatten it.
- Red-team your environments continuously. Agents evolve. An environment that was robust against last month's model may have exploitable shortcuts against this month's. Adversarial validation needs to be an ongoing process, not a one-time audit.
- Annotate failure modes, not just successes. The trajectories where agents go wrong — particularly the ones that look plausible but introduce regressions — are some of the most valuable training examples in the entire corpus. Collect and annotate them deliberately.
Diraflow builds agentic evaluation datasets with sandboxed execution environments, multi-trajectory annotation, subgoal-level reward structures, and adversarial environment validation built in by default. If you're developing LLM agents and want evaluation data that actually predicts production performance, get in touch. We respond within one business day.