DiraflowDiraflow

The Diraflow Blog

Thinking on AI training data, agentic evaluation, and the evolving landscape of human-AI collaboration.

Agentic evaluation datasets ✦ Featured
Deep Dive

Why agentic evaluation datasets are fundamentally different — and how to build them right

Multi-step agent evaluation requires a new approach to data design. We break down the key differences from single-turn annotation and share lessons from dozens of agentic training projects.

Read More
Red-teaming at scale
Apr 7, 2026Safety

Red-teaming at scale: lessons from 5,000 adversarial prompts

Building effective safety datasets means understanding what makes an adversarial prompt genuinely dangerous — not merely surprising.

Read More
Lawrence J. Lawrence J.
6 min
Mathematical reasoning datasets
Mar 28, 2026STEM Data

The hidden complexity of mathematical reasoning datasets

Building useful math training data requires careful attention to notation, proof style, error taxonomy, and solution diversity.

Read More
Francis M Francis M
5 min
Inter-annotator agreement
Mar 19, 2026Operations

Inter-annotator agreement: what it really tells you about data quality

IAA scores are widely cited but poorly understood. We explain what Cohen's κ measures — and what it misses in complex annotation tasks.

Read More
Mark K Mark K
7 min
Preference annotator
Mar 10, 2026RLHF

What makes a great preference annotator? Insights from 1,000 calibration tasks

We analysed calibration data from over a thousand annotators to identify what predicts high-quality RLHF feedback.

Read More
Eustace E. Eustace E.
9 min
Data flywheel
Feb 26, 2026Industry

The data flywheel: how training data quality compounds over model generations

The quality of data you build today shapes the capabilities of models trained years from now. We explore the long-term flywheel.

Read More
Dominic G. Dominic G.
6 min
Coding task environments
Feb 12, 2026Engineering

Building coding task environments that actually test agent capability

Most code evaluation environments are too clean. Real software engineering tasks are messy and ambiguous. Here's how we design for that.

Read More
Martin O. Martin O.
8 min