Diraflow Diraflow

The Diraflow blog

Thinking on AI training data, agentic evaluation, and the evolving landscape of human-AI collaboration.

Why agentic evaluation datasets are fundamentally different — and how to build them right

Multi-step agent evaluation requires a new approach to data design. We break down the key differences from single-turn annotation and share lessons from dozens of agentic training projects.

Single-turn vs. multi-step: a different game entirely

Single-turn annotation is well-understood: show a model a prompt, evaluate the response, done. Agentic evaluation is different because the model makes sequential decisions across many steps, with each choice affecting what comes next. A single wrong tool call five steps in can derail an entire task chain.

The challenge of partial credit

In single-turn tasks, scoring is relatively binary — correct or not, helpful or not. In agentic tasks, a model may complete 8 of 10 sub-goals correctly. Building rubrics that meaningfully award partial credit without inflating scores for fundamentally flawed trajectories is one of the hardest problems we've had to solve.

Designing tasks that expose real failure modes

Most synthetic agentic benchmarks are too clean. Real-world agent tasks involve ambiguous instructions, missing context, and tools that sometimes fail. Our dataset design deliberately introduces these messiness factors because a model that only works in sterile environments isn't ready for production.

Lessons from the field

After dozens of agentic annotation projects, the single biggest lesson is this: invest heavily in task design before you touch annotation. The quality of your evaluation is bounded by the quality of your task specification. Weak tasks produce misleading data no matter how skilled your annotators are.

Red-teaming at scale: lessons from 50,000 adversarial prompts

Building effective safety datasets means understanding what makes an adversarial prompt genuinely dangerous — not merely surprising.

What red-teaming actually involves

Red-teaming is the practice of deliberately stress-testing an AI system by crafting inputs designed to expose weaknesses, bypass safety guardrails, or elicit harmful outputs. Unlike standard QA testing, it requires an adversarial mindset — thinking like someone who actively wants the model to fail.

Why scale changes everything

At small volumes, a skilled human reviewer can catch most dangerous prompts by intuition. At 50,000 prompts, intuition breaks down. Patterns that look dangerous on the surface often aren't, while genuinely harmful inputs can be deceptively mundane in phrasing. You need systematic taxonomies and calibrated reviewers, not just sharp eyes.

The key distinction: dangerous vs. surprising

Our biggest insight was separating "adversarial" from "harmful." A prompt can be creative, unexpected, and technically adversarial — without posing real-world risk. The prompts that matter are the ones that could cause concrete harm if acted upon. That distinction shapes every labelling decision we make.

What we learned about annotator calibration

Annotators naturally cluster around surface features — aggressive language, taboo topics, unusual formatting. Training them to evaluate downstream risk rather than surface shock value was the hardest, most important part of this project. It required multiple calibration rounds and detailed rubrics that we're still refining.

The hidden complexity of mathematical reasoning datasets

Building useful math training data requires careful attention to notation, proof style, error taxonomy, and solution diversity.

Notation is not neutral

The same mathematical concept can be expressed in dozens of notational styles — LaTeX, natural language, symbolic, step-by-step prose. A model trained predominantly on one style will struggle with others. Diverse notation isn't a nice-to-have in math datasets; it's a core requirement for robust generalisation.

Proof style diversity matters too

Formal proofs, informal proofs, and worked examples are cognitively distinct. Models trained on a healthy mixture of all three develop more flexible reasoning than those trained on a single style — even if that style is rigorous and high-quality.

Building a useful error taxonomy

Not all math errors are equal. Arithmetic slips, logical gaps, incorrect theorem applications, and sign errors each reflect different failure modes and require different correction signals. Without a principled error taxonomy, you can't tell whether a model is improving at the right things.

Why solution diversity is undervalued

Many benchmark problems have a single canonical solution path. But real mathematical competence includes knowing multiple routes to the same answer. When we include diverse correct solutions in training data, models become measurably better at novel problem-solving — not just pattern-matching to memorised paths.

Inter-annotator agreement: what it really tells you about data quality

IAA scores are widely cited but poorly understood. We explain what Cohen's κ measures — and what it misses in complex annotation tasks.

What Cohen's κ actually measures

Cohen's kappa corrects for chance agreement between two annotators — which is more meaningful than raw percentage agreement. If two annotators agree 80% of the time but the task only has two labels, that agreement could largely be random. Kappa tells you how much real signal exists above chance.

Where kappa breaks down

Kappa assumes all disagreements are equally bad, which is rarely true in complex annotation. A disagreement between "helpful" and "slightly unhelpful" is not the same as a disagreement between "helpful" and "harmful." For nuanced tasks, weighted kappa or task-specific rubrics are far more informative than a single κ score.

The multi-annotator problem

Most IAA metrics were designed for two annotators. Scale to five or ten and you need Fleiss' kappa or Krippendorff's alpha — but these aggregate signal in ways that can mask systematic disagreements between annotator subgroups, which are often the most important disagreements to understand.

What we use instead — and why

At Diraflow, we treat IAA as a diagnostic tool rather than a quality gate. Low agreement tells us to investigate — is the rubric unclear? Is the task inherently ambiguous? Are certain annotators outliers? The number itself matters far less than the pattern of disagreements it reveals.

What makes a great preference annotator? Insights from 1,000 calibration tasks

We analysed calibration data from over a thousand annotators to identify what predicts high-quality RLHF feedback.

Why preference annotation is harder than it looks

Choosing between two AI responses sounds simple. In practice, annotators must weigh factual accuracy, tone, helpfulness, safety, and often domain-specific nuance — simultaneously, at speed, across hundreds of tasks. The cognitive load is significant, and annotator fatigue is a real and underreported source of data degradation.

What our calibration data revealed

The strongest predictor of annotation quality wasn't domain expertise or educational background — it was the ability to articulate disagreement. Annotators who could explain why they found one response preferable produced dramatically more consistent and useful feedback than those who relied on gut instinct alone.

The consistency-confidence trap

Some annotators are highly consistent but systematically biased — preferring longer responses, or responses that sound confident, regardless of correctness. High internal consistency can look great on IAA metrics while still producing misleading training signals. We now screen specifically for these bias patterns during onboarding.

Building better annotator pipelines

Invest in your calibration phase. A well-designed calibration set — with known-correct answers and clear edge cases — will tell you more about an annotator's suitability in 50 tasks than weeks of production annotation. It also gives annotators a chance to learn your rubrics before they affect real data.

The data flywheel: how training data quality compounds over model generations

The quality of data you build today shapes the capabilities of models trained years from now. We explore the long-term flywheel.

What the flywheel actually is

The data flywheel describes a compounding loop: better training data produces better models, better models generate better synthetic data and identify better edge cases, which in turn produces even better training data. Each revolution of the wheel compounds the gains of the previous one — but it also compounds the errors.

How quality degrades across generations

When synthetic data is used to train models that then generate more synthetic data, small quality gaps amplify with each cycle. A dataset that is 95% accurate today might anchor a model that produces 91% accurate data, which trains a model producing 86% accurate data. The slope looks gentle early on, and catastrophic later.

The compounding advantage of early investment

The inverse is also true. An organisation that invests heavily in human-verified, high-diversity data today will benefit from that investment across multiple model generations. The quality premium doesn't just persist — it multiplies. This is why the best AI labs treat training data as a long-term strategic asset, not a short-term cost.

What this means for data strategy

The decisions you make about data quality, coverage, and diversity in the next 12 months will shape model capabilities in 2028 and beyond. Cutting corners now is not saving money — it is borrowing against future model performance at a very unfavourable rate.

Building coding task environments that actually test agent capability

Most code evaluation environments are too clean. Real software engineering tasks are messy and ambiguous. Here's how we design for that.

The sterile environment problem

Most code benchmarks present isolated functions, clear specifications, and passing test suites. Real software engineering is nothing like this. Codebases are large, dependencies break, requirements shift mid-task, and documentation is outdated. An agent that aces HumanEval can still fail completely in a real repo environment.

How we introduce controlled messiness

Our task design process deliberately injects ambiguity: underspecified requirements that require clarifying questions, intentionally broken dependencies, conflicting comments in existing code, and test suites with known gaps. The goal isn't to confuse agents — it's to ensure our evaluation surfaces the same failure modes they'll encounter in production.

Scoring beyond pass/fail

Pass/fail test outcomes are a floor, not a ceiling, for code evaluation. We also score for code clarity, error handling, appropriate use of existing utilities, and whether the agent asks good clarifying questions before diving into implementation. These dimensions predict real-world engineering quality far better than unit test outcomes alone.

The role of human evaluation

For complex multi-file tasks, automated scoring misses too much. We use a hybrid approach: automated tests catch regressions and obvious failures, while experienced engineers score the harder dimensions. The combination is more expensive but produces evaluation data that is meaningfully better than either approach alone.

Get our latest writing

New posts on AI data, evaluation, and the industry — delivered to your inbox, no more than twice a month. No spam, ever.

Work with us

Ready to build better AI data?

Our team is ready to scope your project and get started fast.

✦ Book a session

Talk to an expert

Tell us about your project and we'll match you with the right team member for a 1-on-1 session.