Why agentic evaluation datasets are fundamentally different

Agentic AI systems are redefining what's possible with large language models. From autonomous coding agents that write, run, and debug software to enterprise AI workflows that chain tool calls across multiple APIs, modern agents operate across multi-step environments where a single wrong decision compounds into cascading failure. Traditional NLP benchmarks — built around isolated prompt-response pairs — are completely blind to this.

Over the past two years, Diraflow has built agentic evaluation datasets for frontier AI labs developing advanced LLM agents. Here is what the work has taught us about why trajectory-level data is so different to build, and what it takes to do it right.

83%

Of agentic failures occur mid-trajectory — not at the final step

4×

More agent errors when evaluation environments are cleaner than production

~3.4

Average number of valid solution paths per task in our agentic dataset library

The single-turn illusion

Standard training datasets present a prompt and expect a response. This framing is so deeply embedded in ML infrastructure that it shapes not just what we measure, but what we think good performance looks like. An agent that produces a correct-looking output on a clean, static benchmark may fail completely when deployed in an environment where it has to take actions, observe their effects, and decide what to do next.

The failure is not usually in the final generation. It's in the steps before it — the agent's ability to navigate to the relevant context, interpret ambiguous intermediate states, recover from errors, and resist the temptation to hallucinate a shortcut to the finish line.

❌ Single-turn evaluation

What most datasets measure

One prompt, one response — no environmental feedback
Context is pre-populated; agent never needs to navigate
Correct output is binary — no partial credit for sound reasoning
No recovery signal — agent can't observe and iterate
Task is fully specified — ambiguity is treated as an error
Measures generation quality, not problem-solving capability

✓ Trajectory-level evaluation

What agentic datasets require

Multi-step sequences with real environment state transitions
Agent must locate relevant context before acting on it
Partial credit for correct reasoning; subgoal-level scoring
Execution feedback — agent runs code, reads output, iterates
Underspecified tasks that require interpretation and inference
Measures the full problem-solving process, not just the final output

What a real agentic trajectory looks like

Below is a simplified annotated trajectory from our dataset — a software engineering task where the agent must diagnose and fix a failing test. Notice how the interesting information lives in the intermediate steps, not the final answer.

Example trajectory — SWE task: fix failing auth test

$ pytest tests/test_auth.py -v

Agent runs the full test suite first to understand scope of failure

3 failed, 12 passed — scope identified

$ cat auth/middleware.py | grep -n "token"

Searches for token handling logic — correct diagnostic strategy

Locates validate_token() on line 47

sed -i 's/expire_at/expires_at/g' auth/middleware.py

Applies fix — but introduces a regression in token creation path

Partial fix — new failure introduced

$ pytest tests/ -v 2>&1 | tail -20

Agent re-runs tests, reads stack trace, identifies regression

Detects new failure in test_create_token

Targeted edit — restores expire_at in creation path only

Correctly scopes the fix after reading the full diff

All 15 tests pass ✓

A static benchmark would give this agent full marks for the final answer. A trajectory-level evaluation gives it credit for steps 1, 2, and 4 — and penalises step 3, flagging it as a regression-introducing edit. That distinction is the signal that actually improves agent behaviour during training.

The five properties that define high-quality agentic data

Environment fidelity

Low-fidelity mocks train brittle policies. Agents evaluated in simplified environments learn to exploit the simplification, not to reason. Realistic tool interfaces — real bash, real API call patterns, real file system state — are not optional.

Subgoal-level scoring

Binary success/failure destroys training signal. An agent that correctly diagnoses a bug but introduces a regression during the fix deserves a different reward signal than one that gave up in step one. Reward structures need to reflect the quality of the reasoning process, not just the terminal state.

Multiple valid trajectories

Two expert annotators solving the same task often take completely different paths — different tool sequences, different diagnostic strategies, different intermediate representations. Both are correct. A dataset that accepts only one path teaches agents that there is a canonical procedure. There isn't.

Adversarial validation

Agents find shortcuts. Any reward structure that can be hacked will be hacked during training. Every environment in our dataset library is red-teamed by a separate team specifically looking for ways to achieve high scores through exploits rather than genuine task completion.

Execution during evaluation

Assessing code that was never run is assessing plausibility, not correctness. Our evaluation pipelines run agents inside sandboxed execution contexts — real code, real tool calls, real failure messages — and score the full execution trajectory, not predicted outputs.

How we build it: the Diraflow pipeline

We combine expert annotators — former software engineers, security researchers, domain specialists — with programmatic sandbox tracing. Annotators work inside instrumented environments where every keystroke, tool invocation, and environmental observation is recorded. We then cluster successful trajectories, identify common failure modes, and use those clusters to generate targeted synthetic variations that cover the long tail of edge cases natural collection misses.

// Trajectory annotation format (simplified)
{
  "task": "Fix the failing test in auth_service.py",
  "initial_state": "snapshot_hash_abc123",
  "expected_final_state": "all_tests_pass",
  "positive_trajectories": [
    ["pytest -v", "cat auth_service.py", "targeted_edit", "pytest"],
    ["find . -name '*.py'", "grep -n token", "git diff", "patch"]
  ],
  "reward_structure": {
    "all_tests_pass": 1.0,
    "no_regressions_introduced": 0.5,
    "edit_minimality": 0.3,
    "correct_diagnostic_step": 0.2
  },
  "anti_shortcut_checks": ["no_delete_tests", "no_hardcode_expected"]
}

"The first time we evaluated our agent on Diraflow's agentic dataset, we discovered it was blindly retrying failed API calls 11 times before giving up — a behaviour that never appeared in static unit tests. That single finding changed our entire training strategy." — ML Lead, large frontier AI lab (name withheld)

ℹ️ On inter-annotator agreement for trajectories

IAA in trajectory annotation is more complex than in single-turn labelling. Two annotators might produce trajectories with no step in common but identical final outcomes — both are valid positives. We measure agreement on three dimensions independently: final state quality, subgoal completion rate, and strategy efficiency. Overall trajectory-level IAA (κ) on our SWE tasks sits at 0.71 — lower than factual annotation but appropriate for the inherent diversity of valid problem-solving approaches.

What this means for teams building agents

The era of single-turn benchmarks as a proxy for agent capability is ending. Leaderboard scores on static datasets tell you whether your model can produce plausible-looking outputs. They tell you almost nothing about whether it can operate in an environment, adapt to feedback, and complete tasks that require sustained multi-step reasoning.

Invest in environment infrastructure first. The quality of your agentic evaluation is bounded by the fidelity of your environments. This is infrastructure work, not annotation work — budget for it accordingly.
Design reward structures before collecting data. Retrofitting subgoal scoring onto existing trajectory data is painful. Define what partial success looks like for each task type before a single trajectory is collected.
Accept trajectory diversity as signal, not noise. Multiple valid paths to the same outcome are a feature of the domain, not a quality problem. Your annotation process should capture that diversity, not flatten it.
Red-team your environments continuously. Agents evolve. An environment that was robust against last month's model may have exploitable shortcuts against this month's. Adversarial validation needs to be an ongoing process, not a one-time audit.
Annotate failure modes, not just successes. The trajectories where agents go wrong — particularly the ones that look plausible but introduce regressions — are some of the most valuable training examples in the entire corpus. Collect and annotate them deliberately.

Work with us

Diraflow builds agentic evaluation datasets with sandboxed execution environments, multi-trajectory annotation, subgoal-level reward structures, and adversarial environment validation built in by default. If you're developing LLM agents and want evaluation data that actually predicts production performance, get in touch. We respond within one business day.