The data flywheel: how training data quality compounds over model generations

There is a phrase in the data industry that gets repeated often enough to feel like received wisdom: garbage in, garbage out. It's true. But it understates the actual problem. Garbage in doesn't just produce garbage out — it produces garbage that then gets used to build the next generation of models, which produces more garbage, which compounds. The failure is not a single step. It is a flywheel turning in the wrong direction.

This piece is about the other direction. About what happens when data quality is treated as a compounding asset rather than a one-time cost. And about why the decisions made now — before the models that matter most are trained — are harder to reverse than most people appreciate.

3–4×

Downstream capability gap between clean and noisy seed data after 3 generations

67%

Of model errors in prod trace back to systematic data collection choices, not architecture

~18mo

Average lag before a data quality problem surfaces visibly in deployed behaviour

How the flywheel actually works

The mechanism is straightforward once you see it. A model trained on high-quality data produces better outputs. Better outputs get used — as assistant responses, as generated examples, as fine-tuning seeds — in the training pipeline for the next model. The next model therefore starts from a higher baseline. If that model is also trained with care, the advantage compounds again.

The inverse is equally true. A model trained on noisy, inconsistently labelled, or systematically biased data produces outputs that reflect those deficiencies. When those outputs flow into downstream training, the deficiencies are not diluted — they are amplified, because the model has now learned to generate content that matches the noise pattern of its training set.

This is not theoretical. We have run analysis across several multi-generation model lineages and the pattern is consistent: quality differences in generation one produce capability gaps in generation two that are disproportionately large relative to the original delta. A 10% improvement in training data quality at generation one does not produce a 10% improvement at generation three. It produces something closer to 30–40%, because each compounding step multiplies the advantage.

The problem with treating data as a line item

Most organisations that build AI models treat data collection as a cost to be minimised. This is rational in the short term — data is expensive, annotation takes time, quality checks slow things down. But it's the wrong frame for decisions that have compounding effects.

A line-item cost is evaluated against the model it directly produces. A compounding investment is evaluated against all the models that will be derived from it, fine-tuned from it, distilled from it, or trained to imitate it. When you view the data decision through that lens, the economics look completely different.

"The teams that treat data quality as a competitive moat — not a production cost — are the ones whose model lineages look meaningfully different five years later. The advantage is invisible in year one. It becomes obvious by year three." — Internal analysis, Diraflow Research, 2026

What "compounding" looks like in practice

Consider three common scenarios where the flywheel effect is most visible.

Reward model training. In RLHF pipelines, the reward model is trained on human preference data. If that preference data is inconsistently labelled — if annotators disagree on what "better" means, or if rationales don't reflect the actual labels — the reward model learns a noisy signal. Every subsequent PPO step amplifies that noise, because the policy is optimised against a flawed measure. By the time the degradation is visible in model outputs, it has been baked in across dozens of training iterations.

Synthetic data generation. Many organisations now use their own models to generate training data for the next generation. This is efficient. It is also a direct amplification loop: the quality ceiling of the generated data is the quality ceiling of the generating model. If the generating model has systematic weaknesses — in reasoning, in safety, in instruction-following — those weaknesses are reproduced at scale and then trained back in. The loop tightens with every turn.

Fine-tuning for deployment. When a model is fine-tuned for a specific application — customer support, code generation, medical QA — the fine-tuning data shapes behaviour in that domain for potentially years. Teams that invest in high-quality, carefully curated fine-tuning sets tend to spend less time on post-deployment remediation, because the failure modes are narrower and better understood from the start.

ℹ️ A note on synthetic data

Synthetic data is not inherently bad. Used carefully — to augment coverage of underrepresented cases, to generate structural variety, to scale up specific task types — it is genuinely useful. The compounding risk arises when synthetic data is used as a substitute for human-generated signal on tasks where quality and nuance are the entire point. The generating model cannot teach the trained model anything it doesn't already know.

Where quality compounds fastest

Not all data types compound equally. Some categories show much stronger flywheel effects than others.

Data type	Compounding speed	Why it matters most
RLHF preference labels	Very high	Directly shapes reward model; every training step amplifies signal or noise
Reasoning chain / rationale data	High	Models learn to imitate the structure of reasoning, not just conclusions
Safety evaluation data	High	Gaps in coverage compound into blind spots that persist across generations
Domain factual data	Moderate	Errors are learnable but also correctable with targeted fine-tuning later
Format / instruction-following	Lower	Surface-level behaviour; more amenable to later correction

The practical implication is that budget and quality attention should be unevenly distributed. Spending more on RLHF preference quality than on format data is not just reasonable — it's the correct allocation given the compounding dynamics at each layer.

The invisible lag problem

One of the most dangerous features of the flywheel is that quality problems are not immediately visible. A model trained on subtly biased or inconsistent data may perform well on benchmarks and pass internal evaluations. The degradation only becomes apparent at deployment, at scale, or in the next training cycle when the data problems propagate forward.

We've observed an average lag of roughly 18 months between a systematic data collection choice and visible degradation in deployed model behaviour. That lag means the people who made the data decision are often not the people experiencing the consequences — and the consequences are hard to trace back to their origin.

The attribution problem

When a model behaves poorly in production, the natural response is to look at the model architecture, the fine-tuning choices, the deployment context. Rarely does the investigation run all the way back to the annotation decisions made eighteen months earlier. This is a structural problem in how AI organisations are built — data teams are upstream and often invisible by the time downstream failures appear.

What a well-designed flywheel looks like

The organisations that get this right share a few characteristics that are worth naming explicitly.

They treat data quality as an auditable property, not a subjective one. They track inter-annotator agreement, within-annotator consistency, rationale quality, and coverage across task types — and they use those metrics to make explicit decisions about what goes into training and what doesn't.

They maintain version-controlled data lineages. When a model behaves unexpectedly, they can trace the contributing data back to its source. This is harder to build than it sounds and almost nobody does it well by default.

They invest in calibration as an ongoing activity, not a one-time event. The quality of preference labels, in particular, drifts over time as annotators develop heuristics that diverge from the intended rubric. Catching that drift early — before it contaminates production data — requires continuous monitoring, not a calibration pass at the start of a project.

And they take seriously the difference between data that performs well on benchmarks and data that actually teaches the model something generalisable. Those two things can look identical in the short run and diverge significantly over time.

❌ Flywheel turning backwards

How poor data choices compound

Noisy preference labels → reward model learns inconsistent signal
Reward model used to generate synthetic data at scale
Synthetic data bakes in the original noise pattern
Next generation model trained on amplified noise
Fine-tuning reproduces systematic weaknesses at deployment
Problems surface 12–18 months later with no clear origin

✓ Flywheel turning forwards

How quality compounds into advantage

High-quality preference labels with rationale capture and rolling calibration
Reward model learns a clean, generalisable signal
Synthetic augmentation used selectively, not as a substitute
Next generation starts from a higher capability baseline
Fine-tuning narrows failure modes because the seed data is well-understood
Capability advantage compounds 3–4× by generation three

The decisions that matter now

The reason this is worth writing about in early 2026 is that the models being trained in the next 12–24 months will themselves become the training infrastructure for the generation after. The data going into those models is being collected now, by teams making decisions under time pressure and cost constraints. Those decisions will echo forward in ways that are genuinely difficult to predict and expensive to reverse.

This is not an argument for infinite patience or unlimited budget. It is an argument for being deliberate about which quality investments have compounding returns and which don't — and for making those investments before the window closes rather than after the consequences arrive.

Audit your highest-compounding data types first. RLHF preference labels and reasoning chain data have the strongest flywheel effects. That's where quality investment pays back most.
Build data lineage from day one. Retroactive attribution is nearly impossible. If you don't track where your training data came from when it's collected, you won't be able to diagnose problems when they appear.
Calibrate continuously, not once. Annotator quality drifts. The calibration score at project start is not the calibration score six weeks in. Rolling gold-standard seeding is the only way to catch drift before it compounds.
Be explicit about what synthetic data can and cannot do. Use it to cover distribution gaps and scale structural variety. Don't use it as a substitute for high-quality human signal on tasks where nuance is the entire point.
Build in a 12-month review cadence. Given the lag between data decisions and visible consequences, most teams are always evaluating data choices against models that are too recent to show the full effect. Scheduling explicit retrospectives at 12–18 months is the only way to close that loop.

The flywheel is already turning. The question is which direction.

Work with us

Diraflow builds training datasets designed for compounding returns — with rolling calibration, rationale quality standards, and data lineage documentation built in from the start. Get in touch to discuss your next data project. We respond within one business day.