arXiv 2026

Cliff Tokens

Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Jaeyong Ko1, Pilsung Kang1, Yukyung Lee2,†

1Seoul National University    2Boston University

Corresponding author

Overview

Existing analyses often localize failure after it has already emerged in a reasoning trace. Cliff tokens target the preceding failure trigger token: the token at which the trace begins to shift toward failure.

Cliff token identification overview
Overview of cliff tokens. A reasoning trace can remain recoverable until a single token causes token-wise potential collapse. Cliff tokens identify the token where the trace shifts toward failure.
  1. We formalize cliff tokens with a z-test-based adaptive threshold that separates statistically significant reasoning failures from sampling noise, showing that these tokens act as failure triggers.
  2. We introduce a cliff taxonomy of deterministic, uncertain, and sampled-off cliffs, demonstrating that each category has distinct probabilistic characteristics.
  3. We validate single-token supervision at cliff positions through Cliff-DPO post-training to improve reasoning performance, with effectiveness varying across cliff types.

What Is a Cliff Token?

Token-wise potential is the probability that a reasoning process reaches the correct answer, conditioned on the partial trace generated up to a token position. Empirically, we estimate it by rolling out continuations from each prefix and measuring the success rate.

A cliff token is a token where token-wise potential drops significantly from the previous prefix. Before this token, the trace may remain recoverable; after the token is fixed, continuations are more likely to end in failure.

To distinguish statistically significant shifts from rollout noise, the detection criterion uses an adaptive threshold based on a one-sided two-proportion z-test. We use N = 64 rollouts per token position, extending potential-based analysis from coarse trace segments to token-level resolution.

RQ1: Do Cliff Tokens Trigger Reasoning Failures?

Cliff tokens occur more often in incorrect traces, and deleting the first cliff token can restore reasoning. Across seven instruction-tuned models on GSM1K, MATH500, and AIME 2025, incorrect traces are more likely to contain cliff tokens and have higher average cliff-token counts in most model settings.

Cliff token occurrence statistics across models
Cliff-token occurrence. Proportion of traces containing at least one cliff token and average cliff-token counts per trace, aggregated across the three mathematical reasoning benchmarks.
Pass@k comparison between Cliff-del and Cliff-keep on incorrect traces
Cliff-del vs. Cliff-keep. Cliff-del reaches pass@64 = 1.0 across evaluated panels, while Cliff-keep remains between 0.71 and 1.00 in panels with cliff tokens. The gap indicates that the first cliff token acts as a failure trigger.

RQ2: What Probabilistic Patterns Characterize Cliff Tokens?

Cliff tokens exhibit distinct probabilistic structure. Token entropy and token greediness separate them into three failure modes: confident bias, competitive uncertainty, and stochastic sampling noise.

Deterministic cliff

A greedy token with low entropy. The model samples the cliff token with near-absolute certainty.

Uncertain cliff

A greedy token with high entropy. The greedy cliff token is sampled despite high uncertainty.

Sampled-off cliff

A non-greedy token with high entropy. The non-greedy cliff token is sampled stochastically.

Distribution and enrichment table for deterministic, uncertain, and sampled-off cliffs
Distribution and enrichment analysis of the cliff taxonomy. Sampled-off cliffs are strongly enriched across all seven evaluated models, while deterministic cliffs are less frequent than their baseline token occurrence.

RQ3: How Does the Cliff Taxonomy Vary Across Families and Scales?

The cliff taxonomy changes with model family and scale. Deterministic cliffs are largely scale-invariant, while uncertain cliffs reflect model-specific gaps and sampled-off cliffs show scale-asymmetry.

Cross-scale transfer of cliff probability mass between Qwen3-0.6B and Qwen3-8B
Cross-scale transfer of cliff probability mass. Deterministic cliffs stay near zero shift across Qwen3-0.6B and Qwen3-8B, uncertain cliffs lose probability mass in both directions, and sampled-off cliffs show an asymmetric shift across scale directions.

Cliff-DPO

Cliff positions can also provide targeted supervision. Cliff-DPO applies preference optimization at the token position where the reasoning trace diverges into failure.

Cliff-DPO result table
Cliff-DPO results. Training on uncertain + sampled-off cliffs gives the strongest Cliff-DPO variant, matching or exceeding cDPO while using approximately 177x fewer loss-contributing token positions.

Citation

@article{ko2026clifftoken,
  title={Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning},
  author={Ko, Jaeyong and Kang, Pilsung and Lee, Yukyung},
  journal={arXiv preprint arXiv:2606.25524},
  year={2026},
  eprint={2606.25524},
  archivePrefix={arXiv}
}