The hidden complexity of mathematical reasoning datasets

At first glance, building datasets for mathematical reasoning seems straightforward: collect math problems and their solutions. But as soon as you try to train a model that can truly reason — not just pattern-match against a memorised solution bank — you hit a wall of hidden complexity. Over the past two years, Diraflow has constructed over 5,000 annotated math examples spanning algebra, number theory, calculus, and proof-based contest problems. Here is what we learned about the messy reality behind "math data."

5k+

Annotated math examples built across algebra, calculus, number theory, and proofs

89% → 61%

Accuracy drop when a model fine-tuned on Leibniz notation faced identical problems in prime notation

−42%

Reduction in algebraic sign errors after three weeks of fine-tuning on error-annotated data

Notation is not neutral

A single mathematical concept can be expressed in dozens of notationally distinct ways. dy/dx versus f′(x) versus ẏ. C(n,k) versus nCk versus the binomial coefficient symbol. Models trained on one notation family often fail to generalise to others — even though the underlying mathematics is identical. This is not a minor quirk. It is a systematic failure mode that appears in production.

In one experiment, a model fine-tuned on Leibniz notation for derivatives scored 89% accuracy on test questions using the same notation style. On identical problems rewritten in prime notation, accuracy dropped to 61%. That gap is not a reasoning failure — it is overfitting to typography, and it is entirely preventable if notation diversity is built into the data collection pipeline from the start.

Our approach to notation coverage

We built a systematic notation augmentation pipeline. Each seed problem is rewritten in three to five equivalent notational variants by expert mathematicians, then verified for semantic equivalence by a second reviewer. This increases dataset size and substantially improves cross-notation generalisation — at the cost of roughly 2.5× the annotation time per problem. It is worth it.

Proof granularity: the layer most datasets skip

Most math datasets are question–answer pairs: the model outputs a final numeric answer or a short symbolic expression. But real mathematical reasoning is about chains of justification. A model that outputs "42" might have guessed. It might have followed a correct procedure and made an arithmetic slip. It might have applied a valid theorem to the wrong object. A final answer alone cannot distinguish between these, which means a dataset built around final answers cannot train a model to reliably produce correct reasoning — only to produce correct-looking outputs.

To train genuine reasoning capability, we need step-level annotation. We developed a proof-step taxonomy that labels each inference line as one of: definition application, algebraic manipulation, substitution, factorisation, inductive hypothesis, case analysis, contradiction, or bound-tightening. Using this taxonomy, we build datasets where the training target is not just the answer but a structured proof sketch — the reasoning trace that produces it.

// Example: step-annotated proof (simplified JSON)
{
  "problem": "Prove that the sum of two even integers is even.",
  "proof_steps": [
    {
      "text": "Let a = 2k and b = 2m for integers k, m",
      "type": "definition_application"
    },
    {
      "text": "Then a + b = 2k + 2m",
      "type": "substitution"
    },
    {
      "text": "= 2(k + m)",
      "type": "factorisation"
    },
    {
      "text": "Since k + m is an integer, a + b is divisible by 2",
      "type": "definition_application"
    }
  ],
  "final_answer": "a + b is even. □"
}

Models trained on this step-annotated data show substantially stronger performance on out-of-distribution proof problems. More practically, they can articulate their reasoning when asked — which is the behaviour that makes a math AI actually useful rather than just occasionally correct.

Error taxonomies: why wrong answers are your most valuable data

One of the most overlooked components of a math dataset is the negative examples — but not all errors are equally informative. A model (or a student) might make an arithmetic slip, confuse a sign, misapply a theorem to a case where its conditions aren't satisfied, or commit a logical non-sequitur that happens to produce the right answer for the wrong reason. Each of these error types has a different cause and requires a different correction signal.

We built an error taxonomy with 18 categories and had annotators label not only whether a model's output was correct, but specifically why it was wrong when it failed. This turned out to be critical for reward model training: we could give partial credit to correct reasoning that contained an arithmetic slip, and strongly penalise confident logical leaps — rather than treating all wrong answers identically, which destroys the training signal that would distinguish them.

"Using Diraflow's error-annotated dataset, we reduced the rate of algebraic sign errors in our model by 42% in three weeks of fine-tuning, while keeping final-answer accuracy stable across the rest of the problem distribution." — Head of Research, AI math tutoring startup

The diversity of solution strategies

A common mistake in math dataset collection is assuming there is one canonical solution per problem. Many problems can be solved in multiple ways — analytic versus combinatorial, forward versus backward induction, using different theorems that each provide a valid path to the same conclusion. A dataset built around single solution paths trains models to treat those paths as rigid templates. When a model encounters a novel problem that would benefit from an alternative approach, it applies its memorised template and gets stuck.

We intentionally collect multiple solution strategies for at least 30% of our problems, typically by assigning the same problem to different expert annotators with different mathematical backgrounds and then reconciling the results. The additional cost is significant. The payoff — models that can select between strategies rather than blindly applying one — has been clearly visible in downstream evaluations.

❌ Single-path datasets

The standard approach and its failure modes

One canonical solution per problem, usually the shortest
Model learns a fixed template for each problem type
Novel problems that require a different approach cause failures
No signal for when to switch strategy mid-problem
Proof problems with multiple valid structures are flattened to one

✓ Multi-strategy datasets

What we build toward at Diraflow

Multiple valid solution paths captured per problem
Model learns that strategy selection is part of reasoning
Notational variants included to prevent typography overfitting
Step-level annotation captures the reasoning structure, not just output
Error taxonomy enables partial-credit reward signal during fine-tuning

The data leakage problem in math datasets

Most widely-used math problem sets — AMC, AIME, GSM8K, MATH — are substantially present in the pre-training corpora of major LLMs. This creates an evaluation problem that is easy to overlook: a model that answers competition problems correctly may be recalling solutions from pre-training rather than reasoning through them. High benchmark scores on contaminated data are not evidence of generalisation.

We verify originality using n-gram overlap checks against known pre-training corpora as far as that is possible to reconstruct, and we create original problem variants when contamination risk is high. Original variants require more effort to produce — they need a mathematician who can construct a problem with equivalent difficulty and structure but novel surface form — but they are the only way to build evaluation sets that actually measure generalisation rather than memorisation.

ℹ️ On LaTeX normalisation

Before any comparison or evaluation step, canonicalise your LaTeX. Normalise spacing, convert equivalent forms (e.g. \frac{a}{b} and a/b when both are present), unify delimiter styles. Without this step, semantically identical expressions fail string-match comparisons and introduce false negatives into your evaluation pipeline. It is unglamorous infrastructure work that has an outsized effect on the reliability of your quality metrics.

Annotator requirements are non-negotiable

Mathematical annotation is one of the few annotation task types where domain expertise is not just helpful but strictly required. A generalist annotator cannot reliably judge whether a proof step is valid, identify the error class of an incorrect solution, or recognise that two different proofs of the same theorem are both correct. We only hire annotators with at minimum a bachelor's degree in mathematics or a closely related quantitative field, and we run weekly calibration sessions using gold-standard problems with known solution structures.

This is expensive. It is also the only approach that produces data worth training on. The signal-to-noise ratio in math annotation done by non-specialists is low enough that scaling the dataset does not compensate — you get more data, but the quality ceiling prevents it from teaching the model anything it doesn't already know from pattern-matching.

Operational recommendations

Build notation augmentation in from the start. Retrofitting notation diversity into an existing dataset is painful. Design the collection pipeline to capture three to five notational variants per problem before anything enters production.
Annotate proof steps, not just answers. Final-answer-only datasets cannot teach reasoning structure. Even partial step annotation — covering the key inference transitions — substantially improves the quality of the training signal.
Build and use an error taxonomy. Classifying why an answer is wrong is more valuable for reward model training than simply labelling it incorrect. An 18-category taxonomy sounds like overhead; in practice it is the difference between a reward model that improves reasoning and one that just rewards confident-sounding outputs.
Collect multiple solution strategies deliberately. Assign the same problem to annotators with different mathematical backgrounds. Reconcile the results. The additional cost pays back in model robustness on novel problem types.
Test for notation overfitting explicitly. Include cross-notation variants in your evaluation set from the beginning, not as an afterthought. A model that cannot transfer across equivalent notations is not reasoning mathematically — it is doing a sophisticated form of symbol matching.
Treat non-numeric answers as a first-class task type. Proofs, derivations, "no solution exists" responses, and open-ended constructions are systematically underrepresented in most math datasets. They require separate evaluation rubrics and separate annotator calibration, but they represent a large fraction of real mathematical work.

The hidden complexity of mathematical reasoning datasets is not a bug in the domain. It is a reflection of how rich mathematical thinking actually is — the variety of valid approaches, the importance of reasoning structure, the way notation shapes understanding. Embracing that complexity, rather than flattening it for convenience, is the only path to datasets that train models capable of doing real mathematics rather than mimicking its surface.

Work with us

Diraflow builds mathematical reasoning datasets with expert annotators, step-level proof annotation, error taxonomy labelling, and notation augmentation built in by default. If you're training a model on quantitative reasoning tasks, get in touch. We respond within one business day.