The Diraflow Blog

Thinking on AI training data, agentic evaluation, and the evolving landscape of human-AI collaboration.

Why agentic evaluation datasets are fundamentally different — and how to build them right

Multi-step agent evaluation requires a new approach to data design. We break down the key differences from single-turn annotation and share lessons from dozens of agentic training projects.

Red-teaming at scale: lessons from 5,000 adversarial prompts

Building effective safety datasets means understanding what makes an adversarial prompt genuinely dangerous — not merely surprising.

Lawrence J.

6 min

Mar 28, 2026STEM Data

The hidden complexity of mathematical reasoning datasets

Building useful math training data requires careful attention to notation, proof style, error taxonomy, and solution diversity.

Francis M

5 min

Mar 19, 2026Operations

Inter-annotator agreement: what it really tells you about data quality

IAA scores are widely cited but poorly understood. We explain what Cohen's κ measures — and what it misses in complex annotation tasks.

Mark K

7 min

Mar 10, 2026RLHF

What makes a great preference annotator? Insights from 1,000 calibration tasks

We analysed calibration data from over a thousand annotators to identify what predicts high-quality RLHF feedback.

Eustace E.

9 min

Feb 26, 2026Industry

The data flywheel: how training data quality compounds over model generations

The quality of data you build today shapes the capabilities of models trained years from now. We explore the long-term flywheel.

Dominic G.

6 min

Feb 12, 2026Engineering

Building coding task environments that actually test agent capability

Most code evaluation environments are too clean. Real software engineering tasks are messy and ambiguous. Here's how we design for that.

Martin O.

8 min

Want to write for us? Get in touch →