TL;DR: We show MDM samplers admit proper likelihood evaluation—exact (not a bound) and under the test-time distribution—revealing MDMs are much better than the ELBO suggests, closing up to 82% of the gap with autoregressive models.
Abstract
Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions. Yet MDMs lack proper likelihood evaluation: the evidence lower bound (ELBO) is not only a loose bound on log-likelihood, but, as we show, is also computed under the training distribution rather than the test-time distribution. We resolve this within our DUEL framework, which unifies leading MDM sampling strategies that employ deterministic position selection. We prove that DUEL samplers admit exact likelihood computation under the test-time distribution—giving MDMs proper likelihood, and hence proper perplexity, for the first time. This proper perplexity is the natural analogue of autoregressive perplexity and lets us revisit key questions about MDMs. MDMs are substantially better than previously thought: the MDM–autoregressive perplexity gap shrinks by up to 32% on in-domain data and 82% on zero-shot benchmarks. DUEL enables the first principled comparison of fast, parallel samplers across compute budgets—an analysis impossible with the ELBO and unreliable with generative perplexity—identifying a strong default method. Finally, oracle search over position orderings reveals MDMs can far surpass autoregressive models—achieving 36.47 vs. 52.11 perplexity on AG News—demonstrating the ceiling of MDM performance has not yet been reached.
How MDMs Generate Text
Masked diffusion models (MDMs) generate text by starting with a fully masked sequence and iteratively revealing tokens [Sahoo et al., 2024]. At each step, two things happen:
Position selection: A policy $\pi$ outputs a distribution over masked positions; one is chosen to reveal.
Token prediction: A denoising network $x_\theta$ outputs logits; we define the denoising distribution $p_\theta(v \mid \mathbf{z}) = P_\ell[v]$ where $\mathbf{P} = \mathrm{softmax}(x_\theta(\mathbf{z}))$ is the token probability matrix. A token is sampled from this distribution for the selected position.
This repeats until all positions are revealed. The figure below shows one step:
Figure: One step of MDM generation. Position selection chooses among masked positions (here, equal probability for positions 2 and 4). Token prediction samples a token from the vocabulary distribution at the selected position.
Deterministic Position Selection
The position selection policy $\pi$ can be random or deterministic. Nearly all leading MDM sampling strategies use deterministic selection—a fixed rule that maps the current sequence state to a set of positions, with no randomness involved. This includes the sampling strategies used by most fast samplers and state-of-the-art MDMs (e.g. LLaDA).
We call such a rule an unmasking rule $F$. Given a partially-revealed sequence $\mathbf{z}$, the rule $F(\mathbf{z})$ returns a non-empty subset of masked positions to unmask:
Since the token probabilities $\mathbf{P} = \mathrm{softmax}(x_\theta(\mathbf{z}))$ are deterministic functions of $\mathbf{z}$, rules depending on $\mathbf{P}$ (like greedy confidence) are valid.
Common Unmasking Rules — Instantiations from the Literature
Let $\mathcal{M}(\mathbf{z})$ denote masked positions. $P_\ell^{(1)}$ and $P_\ell^{(2)}$ are the top-two token probabilities at position $\ell$.
Left-to-Right: Select the $k$ leftmost masked positions. Recovers autoregressive generation when $k=1$.
Greedy Confidence[Nie et al., 2025]: Select $k$ positions with highest $P_\ell^{(1)}$. Used by LLaDA.
Probability Margin[Kim et al., 2025]: Select $k$ positions with highest gap $P_\ell^{(1)} - P_\ell^{(2)}$.
Confidence Threshold[Wu et al., 2025]: Select all positions where $P_\ell^{(1)} \geq \mu$. Enables adaptive parallelism.
The DUEL Framework
A DUEL sampler combines a pretrained denoising network $x_\theta$ with an unmasking rule $F$. In practice, this is just a pretrained MDM together with a choice of how to sample from it.
Definition (DUEL Sampler)
A DUEL sampler is a pair $(\color{#7c3aed}{x_\theta}, \color{#2563eb}{F})$ consisting of:
Deterministic unmasking rule $F$: selects which positions to reveal
The rule $F$ induces a deterministic unmasking policy $\pi^F$—it places all probability mass on the positions $F$ selects, with no randomness in position selection. Together, $(x_\theta, F)$ defines a complete generative procedure.
The sampling procedure is straightforward: start fully masked, repeatedly select positions via $F$ and sample tokens at those positions, until all positions are revealed.
Implicit Distribution: Algorithm 1 defines an implicit generative procedure—we can draw samples without writing down the induced distribution in closed form. Different unmasking rules $F$ yield different distributions, even with the same network $x_\theta$.
Any-Order Autoregressive Formulation
A DUEL sampler generates samples—but what distribution does it sample from? To make this explicit, we build on the any-order autoregressive model (AO-ARM) interpretation of MDMs. This formulation decomposes generation into position-selection and token-prediction steps, allowing us to write down the joint probability of generating a sequence via any ordered partition.
Ordered Partitions
To formalize the sequence of unmasking steps during generation, we introduce ordered partitions. An ordered partition $\sigma = (\sigma_1, \ldots, \sigma_T)$ is a tuple of non-empty subsets of $\{1, \ldots, L\}$ satisfying:
Pairwise disjoint: $\sigma_s \cap \sigma_t = \emptyset$ for all $s \neq t$
We write $\mathbf{x}_{<\sigma_t}$ for the partial sequence where positions in $\sigma_1 \cup \cdots \cup \sigma_{t-1}$ are revealed and all others remain masked.
Figure: Ordered partitions for a length-4 sequence. Sequential unmasking reveals one position at a time; parallel unmasking reveals multiple positions per step.
The Induced Distribution
The AO-ARM formulation decomposes generation into position-selection and token-prediction steps. The joint probability $p_\theta(\mathbf{x}, \sigma)$ measures the probability that sequence $\mathbf{x}$ was generated via a specific unmasking trajectory $\sigma$. To obtain the data likelihood, we marginalize (sum) over all possible ordered partitions:
The sum ranges over all ordered partitions $\sigma$, each corresponding to a different generation trajectory. This sum has super-exponentially many terms ($\geq L!$ for sequential unmasking alone).
Proper Likelihood Evaluation
DUEL delivers proper likelihood evaluation for MDMs: the likelihood is exact (not a lower bound) and computed under the test-time distribution (not the training distribution). Here we show why this is possible.
Computing the induced distribution directly appears intractable—the summation is over all unmasking orders $\sigma$ (with $ \geq L!$ terms) where each requires a separate neural network evaluation. However, a deterministic rule $F$ collapses the sum to a single term:
Given any partial sequence, $F$ outputs a unique set of positions—there is no randomness
Any ordered partition that deviates from what $F$ would select receives zero probability under $\pi^F$
Exactly one ordered partition $\sigma^*$ is consistent with the policy
Theorem (DUEL Exact Likelihood)
For a DUEL sampler $(x_\theta, F)$, the log-likelihood is exact (not a bound) and computed under the test-time distribution $p_\theta^{\pi^F}$:
where $\sigma^* = (\sigma^*_1, \ldots, \sigma^*_T)$ is the unique ordered partition satisfying $\sigma^*_t = F(\mathbf{x}_{<\sigma^*_t})$ at each step—the partition the sampler would produce when generating $\mathbf{x}$.
Likelihood follows generation: The algorithm mirrors sampling—simulate the unmasking process, but instead of sampling tokens, reveal the true tokens from $\mathbf{x}$ and accumulate their log-probabilities. This is the crux of DUEL: the likelihood computation follows the same path as generation.
Standard MDM metrics fail to provide proper likelihood evaluation. The ELBO is a loose bound (not exact) computed under the wrong distribution (uniform, not test-time). Generative perplexity avoids both issues but introduces external model bias.
Evidence Lower Bound (ELBO)
Lower bound, not exact. The gap between ELBO and true likelihood (the variational gap) can be large, underestimating model quality.
Wrong distribution. ELBO measures likelihood under uniform random position selection $\pi^{\mathrm{unif}}$ (all orderings equally likely). At test time, deterministic policies $\pi^F$ avoid bad orderings that uniform selection includes—but ELBO doesn't reflect this.
Generative Perplexity
Reference model bias. Samples are scored by GPT-2, whose preferences may not reflect true quality.
Ignores diversity. A model repeating one good phrase scores well despite mode collapse.
DUEL resolves both: It computes exact likelihood (not a bound) under the test-time distribution $p_\theta^{\pi^F}$ (the actual distribution sampled from with deterministic policy $\pi^F$, not uniform $\pi^{\mathrm{unif}}$), using only the MDM itself (no external reference model).
Experiment 1: Reassessing the Perplexity Gap
MDMs have consistently lagged behind ARMs in perplexity benchmarks. But how much of this gap reflects true model quality, and how much is an artifact of using the ELBO instead of exact likelihood? We evaluate multiple MDM architectures—including SEDD, MDLM, and BD3-LM—against a comparable autoregressive baseline.
For each model, we compute both the ELBO (the standard metric) and DUEL exact likelihood using greedy confidence unmasking. We report perplexity (lower is better) and the percentage of the ARM-MDM gap that DUEL closes:
where $\Delta = \mathrm{PPL}_{\mathrm{MDM}} - \mathrm{PPL}_{\mathrm{ARM}}$ measures how far the MDM lags behind the ARM.
In-Domain Results
DUEL consistently improves over the ELBO, closing 20-32% of the gap across all models. These gains come entirely from proper evaluation—the underlying models are unchanged. The improvement is expected: the ELBO averages over all orderings equally, including poor ones, while deterministic policies avoid such orderings.
OpenWebText (ARM: 17.54)
Model
ELBO
DUEL
Gap Closed
SEDD
$\leq$24.10
22.58
23.2%
MDLM
$\leq$22.98
21.86
20.6%
BD3-LM ($L'$=4)
$\leq$20.73
19.73
31.3%
BD3-LM ($L'$=8)
$\leq$21.68
20.37
31.6%
BD3-LM ($L'$=16)
$\leq$22.27
20.76
31.9%
LM1B (ARM: 26.73)
Model
ELBO
DUEL
Gap Closed
SEDD
$\leq$33.79
32.48
18.6%
MDLM
$\leq$32.76
31.02
28.9%
BD3-LM ($L'$=4)
$\leq$30.51
29.40
29.4%
BD3-LM ($L'$=8)
$\leq$31.02
29.69
31.0%
BD3-LM ($L'$=16)
$\leq$31.26
29.86
30.8%
Zero-Shot Transfer & Large-Scale Results
Zero-shot evaluation on held-out datasets shows even larger improvements. BD3-LM closes up to 82% of the gap on Penn Treebank, with average gap closure of 30-49% across models. At 8B parameters, DUEL consistently reduces LLaDA perplexity compared to the ELBO across all benchmarks.
Zero-Shot Transfer (Gap Closed %)
Dataset
SEDD
MDLM
BD3-LM
PTB
31.3%
34.3%
81.8%
Wikitext
40.8%
28.5%
31.4%
AG News
25.7%
27.8%
51.7%
Average
30.0%
29.9%
48.7%
Large-Scale (8B Parameters)
Model
Method
Wiki
Lamb.
AG
Llama3
Exact
7.94
32.40
41.29
LLaDA
ELBO
$\leq$15.3
$\leq$39.0
$\leq$85.2
DUEL
14.50
36.00
78.91
Takeaway: These improvements come purely from proper likelihood evaluation—exact likelihood under the test-time distribution, rather than a loose bound under the training distribution. The models themselves are unchanged. MDMs are closer to ARMs than previously believed.
Experiment 2: Comparing Sampling Strategies
DUEL enables principled comparison of unmasking rules by fixing the denoiser $x_\theta$ and varying only the rule $F$. This comparison is impossible with the ELBO (which ignores the unmasking policy entirely) and unreliable with generative perplexity (which depends on a biased reference model).
Comparing Fast Samplers
Fast samplers differ in their unmasking rule $F$. Aggressive rules unmask many positions per step for speed; conservative rules unmask fewer for quality. DUEL enables principled comparison by measuring exact likelihood under the test-time distribution. Other metrics—the ELBO, generative perplexity, entropy, MAUVE—are unreliable: the ELBO ignores the unmasking policy, generative perplexity depends on a biased reference model, and entropy/MAUVE have other shortcomings.
DUEL perplexity by unmasking rule on OWT. Model: BD3-LM ($L'$=16). ELBO: $\leq$23.52.
Unmask Rule $F$
NFE=128
NFE=256
NFE=512
NFE=1024
Left-to-Right
240.27
109.71
45.99
21.46
Greedy Confidence
164.94
66.64
34.74
22.03
Probability Margin
140.38
57.48
32.24
22.05
Confidence Threshold*
226.97
116.83
43.48
22.05
*Adaptive NFE; thresholds chosen to match target step counts.
Takeaway: Probability margin performs best at low compute. DUEL gives consistent rankings across budgets; generative perplexity rankings cross repeatedly—left-to-right scores best at 128 NFE despite worst DUEL perplexity (degenerate text that GPT-2 rewards).
Why Other Metrics Fail
We compare DUEL perplexity against three sample-based metrics—generative perplexity, entropy, and MAUVE—and find each unreliable:
Generative perplexity rankings are inconsistent across NFE budgets. Strikingly, left-to-right achieves the lowest generative perplexity at 128 NFE despite having the worst DUEL perplexity. The explanation: left-to-right produces low-entropy, repetitive text at low NFE, and autoregressive evaluators like GPT-2 reward such degenerate output with low perplexity. Generative perplexity also distorts magnitude, showing 30–40× degradation from 1024 to 128 NFE versus 6–11× under DUEL.
Entropy confirms the degeneracy: left-to-right drops to ~4.2 bits at low NFE (repetitive text), while probability margin and greedy confidence maintain ~4.8–5.0 bits (diverse output). However, entropy alone cannot rank samplers—it measures diversity, not quality.
MAUVE saturates near zero at low NFE for all rules, providing almost no discriminative signal. It is most useful for distinguishing high-quality regimes, not for comparing degraded generation.
Since different rules yield different likelihoods, we ask: what is the best possible perplexity achievable over all orderings? The oracle perplexity answers this by exhaustively searching all permutations and selecting the one that maximizes likelihood for each sequence. For BD3-LM ($L'=4$), we search all $4!=24$ permutations per block. This represents an upper bound on MDM performance—the best a model could achieve if it always chose the optimal unmasking order.
Model
Method
Unmask Rule
Perplexity
ARM
Exact
—
52.11
BD3-LM
ELBO
—
$\leq$61.67
BD3-LM
DUEL
Left-to-Right
54.94
BD3-LM
DUEL
Greedy Conf.
56.73
BD3-LM
DUEL
Prob. Margin
57.80
BD3-LM
DUEL
Oracle
36.47
Takeaway: The oracle ordering (36.47) surpasses the ARM baseline (52.11)—a degree of freedom ARMs lack. This suggests potential for test-time compute methods that find better orderings without ground-truth access.
BibTeX
@article{turok2026duel,
title={DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking},
author={Turok, Gilad and De Sa, Chris and Kuleshov, Volodymyr},
year={2026}
}