Github: https://github.com/zhijie-group/LoPA

We introduce the Lookahead Parallel Decoding (LoPA) algorithm for diffusion large language models (dLLMs) inference. LoPA enables up to 10.1 tokens per forward pass (TPF) for state-of-the-art dLLMs—without compromising predictive performance. This represents an unprecedented degree of parallelism, a capability unmatched by previous dLLM decoding methods. Under multi-device deployment, our specialized system LoPA-Dist achieves a single-sample throughput of 1073.9 tokens per second.

Illustration of diffusion LLM inference challenges
Figure 1. Throughput performance of LoPA under guaranteed inference speed. LoPA accelerates the single-sample throughput for D2F-Dream to up to 1073.9 and 774.1 tokens/s on MBPP and GSM8K respectively, significantly outperforming baselines.

Background

DLLMs show significant potential for high-speed inference, yet current confidence-driven decoding strategies are constrained by limited parallelism—typically achieving only 1-3 TPF on math and coding tasks [2, 3]. Our investigation identifies a key insight: during dLLM inference, the degree of parallelism fluctuates sharply with the prediction confidence, which is heavily influenced by the Token Filling Order (TFO). Consequently, standard strategies that greedily prioritize currently high-confidence positions may lead to suboptimal trajectories. To address this, we propose Lookahead Parallel Decoding (LoPA), a training-free, plug-and-play algorithm designed to actively explore superior TFOs to unlock higher parallelism.

Methodology

This section first explains the foundational Confidence-Driven Sampling used in regular dLLM inference [2, 3, 4] and then elaborates on LoPA.

The architecture of standard confidence-driven sampling
Figure 2. Scaling analysis of LoPA on D2F-Dream with varying branch counts. The results illustrate that LoPA effectively scales the TPF of D2F to a peak exceeding 10, thereby significantly reducing the total number of decoding steps.

Preliminary: Confidence-Driven Sampling for dLLMs

Confidence-driven sampling is a prevalent paradigm for current dLLMs to boost parallelism, adopted in models such as Fast-dLLM [2], D2F [3], and SDAR [4]. Specifically, given a sequence $x_t$ with a set of masked positions $M_t$, the dLLM model $p_{\theta}$ outputs a predictive distribution $p_{\theta}(\cdot \mid x_t)$. A candidate sequence $\hat{x}_0 \sim p_{\theta}(\cdot \mid x_t)$ is sampled, and a confidence function, $\text{Conf}(\cdot)$, assigns a score to each position $i \in M_t$. The set of positions to fill, $I_{fill}$, is then determined as:

$$I_{fill} = \begin{cases} \{i \in M_t \mid \text{Conf}(i) > \tau\} & \text{if } \{i \in M_t \mid \text{Conf}(i) > \tau\} \neq \emptyset \\ \{\arg\max_{i \in M_t} \text{Conf}(i)\} & \text{otherwise} \end{cases}$$

The algorithm then accepts the predictions according to $I_{fill}$ and moves to the next iteration.

LoPA

Overview of LoPA
Figure 3. Overview of Lookahead Parallel Decoding (LoPA). In each iteration, LoPA generates a anchor branch alongside multiple lookahead branches (e.g., B1, . . . , Bk ) by independently sampling high-confidence positions from the baseline’s unfilled set. A branch confidence verification mechanism then evaluates all branches in parallel within a single forward pass, selecting the optimal path to maximize future parallelism.

As shown in Figure 3, in every decoding iteration, LoPA looks ahead at multiple TFOs, yielding multiple sampling branches, and then identifies the branch with superior future parallel decoding potential.

Look ahead Multiple TFOs in Parallel

LoPA operates by generating multiple branches. First, it constructs an Anchor Branch ($B_0$) using the standard confidence-driven strategy (filling positions in $I_{fill}$).

LoPA is designed to explore one step further than this anchor branch. To ensure effective and reliable exploration, we prioritize sampling tokens with higher confidence, a strategy that has been proved in Fast-dLLM [2] to yield more stable predictions. Specifically, in addition to $B_0$, we generate $k$ competitive Lookahead Branches. We identify the top-$k$ positions from the anchor branch's unfilled set $M_{B_0}$ that possess the highest confidence scores. For each identified position, we sample it independently to create a distinct branch. This results in a set of $k$ new branches $\{B_1, \dots, B_k\}$, each with its own sequence $x_{B_j}$ and unfilled set $M_{B_j}$.

Branch Confidence-based Verification

Inspired by DeepConf [5], we design a branch confidence metric to guide the selection among candidate decoding paths. Formally, the confidence of a branch $B_j$ is defined as the average prediction confidence over its remaining unfilled positions $M_{B_j}$:

$$C(B_j) = \frac{1}{|M_{B_j}|} \sum_{i \in M_{B_j}} \text{Conf}(i)$$

A higher branch confidence indicates that more unfilled positions are likely to be accepted in the very next decoding step. This directly increases the number of tokens filled per iteration, thereby enhancing the overall parallelism. Beyond this mean confidence, branch confidence can also be quantified by other methods [5], such as applying a sliding window to assess local quality or averaging confidence over the least confident segment to identify weak links.

This verification mechanism offers distinct advantages. First, all candidate branches (Anchor + Lookahead) can be packed and verified within a single forward pass, with custom attention masks ensuring independent computation for each branch. Second, the logits computed during branch evaluation are directly reused in the next decoding step, eliminating the need for additional forward passes.

Application: integration with D2F

LoPA integrates seamlessly with D2F [3], the first open-source diffusion language model whose inference throughput surpasses that of autoregressive (AR) models. Our application of LoPA to D2F incorporates two key enhancements:

  • Parallel Exploration in a Decoding Window: We treat all active blocks in D2F's pipeline as a single window where LoPA's branch exploration and lookahead verification operate. Replacing the original block-level causal attention with a full attention mechanism within this window reduces implementation complexity and enhances computational performance.
  • System Integration and Performance: On the D2F-Dream model, LoPA achieves a TPF of up to 10.1. To leverage this parallelism, we developed a specialized multi-device inference system where LoPA achieves a throughput of 1073.86 tokens per second.

Results

Scaling Analysis of Branch Count

We analyzed the impact of the competitive branch count (k) on TPF and quality using D2F models fine-tuned on Dream[6] and DiffuCoder[7]. Results show TPF consistently improves with k; however, excessive k introduces fluctuations, attributed to the model prioritizing future confidence over local optimality. These results point to an optimal trade-off, where a carefully chosen k can maximize TPF while preserving quality.

TPF Scaling and Accuracy Performance
Figure 4. Scaling Curves of LoPA. LoPA scales the TPF for D2F-Dream and D2F-DiffuCoder to up to 10.1 and 8.3 on GSM8k and HumanEval+ respectively, with comparable performance.

As shown in the Figure 4, on GSM8K, LoPA scales the TPF of D2F-Dream to 10.1 while maintaining a score (73.8) superior to the Dream baseline (72.6). on HumanEval+, LoPA scales the TPF of D2F-DiffuCoder to 8.3 with marginal performance degradation, demonstrating a clear speed-accuracy trade-off.

Tables 1 and 2 below confirm this efficacy across multiple benchmarks.

Table 1. Accuracy-preserving parallelism scaling of Dream on multiple benchmarks across multiple branches. TPF denotes Tokens Per Forward pass. LoPA significantly scales the TPF of D2F-Dream while maintaining or exceeding baseline scores.
ModelDecoding algoMBPP 3-shotMath 4-shotHumanEval 0-shotGSM8K 4-shot
TPFScoreTPFScoreTPFScoreTPFScore
Dreamvanilla156.2133.7155.5172.6
DreamFast-dLLM1.955.61.937.61.855.52.172.6
DreamLoPA3.354.83.437.02.9533.173.3
D2F-Dreamvanilla2.353.82.636.82.556.13.178.5
D2F-DreamLoPA5.456.08.035.26.356.110.173.8
Table 2. Accuracy-preserving parallelism scaling of DiffuCoder on MBPP+ and HumanEval+ benchmarks. LoPA boosts TPF by nearly 4× compared to the vanilla D2F baseline with minimal impact on generation quality.
ModelDecoding algoMBPP++ 0-shotHumanEval++ 0-shot
TPFScoreTPFScore
DiffuCodervanilla161.9165.2
D2F-Diffucodervanilla2.261.92.265.9
D2F-DiffucoderLoPA6.761.68.364.0
Additional Results
Figure 5. Overview of LoPA Branch Parallel Distributed Inference System Design. A key distinction lies in the KV cache management protocol tailored for different backends: LoPA-Dist-NV utilizes a robust two-phase update mechanism to ensure consistency, whereas LoPA-Dist-Ascend adopts a streamlined single-phase update strategy for optimized serving efficiency.

System Throughput and Scalability

To fully exploit LoPA’s parallelism, we designed LoPA-Dist, a distributed inference system utilizing Branch Parallelism (BP).

The system distributes candidate branches across multiple GPUs for concurrent processing. We provide two specialized implementations:

  • LoPA-Dist-NV (CUDA): Optimized for low latency using static KV cache and a two-phase update protocol (Pre-Write and Commit-Winner-Cache) to ensure consistency.
  • LoPA-Dist-Ascend (Ascend 910C): Optimized for high throughput using hybrid parallelism and graph compilation to fuse element-wise operations.

As shown in Table 3, this design achieves near-linear scalability. On the Ascend platform, LoPA-Dist achieves a peak throughput of 1073.86 tokens/s.

Table 3. System performance of D2F-Dream under guaranteed inference speed. The results demonstrate that our system efficiently translates algorithmic parallelism (high TPF) into significant wall-clock acceleration, achieving high Parallelism Utilization (PU) and average throughputs exceeding 1000 tokens/s on the specialized LoPA-Dist-Ascend engine.
ModelPlatformMBPPGSM8K
Avg TPSMax TPSTPFLatencyAvg TPSMax TPSTPFLatency
D2F-Dream-BaseLoPA-Dist-NV708.481470.9515.550.74619.331299.2513.160.85
LoPA-Dist-Ascend1073.862400.1211.920.78856.462751.619.340.75
D2F-Dream-InstructLoPA-Dist-NV636.551811.719.520.14609.901407.5611.420.26
LoPA-Dist-Ascend896.212586.738.640.11897.101868.169.300.21
Table 4. Performance ablation study of D2F-Dream models on different platforms, corresponding to settings S1-S18.
ModelSys. Arch.SettingsMBPP 3-shotGSM8K 4-shot
Avg TPSMax TPSTop-10 TPSScoreAvg TPSMax TPSTop-10 TPSScore
D2F-Dream-BaseLoPA-Dist-NVS1415.19813.04720.3553.00345.52959.05704.3975.97
S2500.331185.77874.8753.40402.52913.12842.8373.54
S3550.371472.41929.7251.20436.22994.82885.2771.19
S4589.221576.931006.5747.20475.581203.611028.1568.16
S5633.161408.40963.6746.80516.851212.651055.0866.79
S6678.261615.301150.6541.80546.721225.211121.5764.14
S7466.27784.33764.5251.80416.91909.82841.9571.27
S8545.901497.22927.6751.40486.941176.14959.3768.39
S9588.001584.28983.0948.60520.701250.671056.0168.01
S10637.381552.561028.9747.00558.011115.261071.6665.05
S11655.451535.101059.7243.80592.941315.931155.1164.44
S12708.481470.951132.7839.80619.331299.251201.1860.88
LoPA-Dist-AscendS13615.742173.71253.0750.20492.941337.601158.1875.06
S14753.782115.551397.8550.20589.771532.991342.7972.86
S15842.972470.791538.1650.00644.341723.191476.2470.58
S16923.352647.121513.5445.60700.141756.581601.9368.69
S17994.882740.541739.8543.00754.752583.761848.8264.29
S181073.862400.121939.2241.80856.462751.612098.7262.55
D2F-Dream-InstructLoPA-Dist-NVS1305.74959.00695.8852.80330.62758.34674.5378.17
S2373.231302.99877.1251.40402.63961.29804.3174.22
S3451.621419.091143.3053.00444.73943.22870.8573.39
S4503.711779.601226.7246.60495.931131.64941.2372.48
S5568.651660.891317.3842.00540.761185.141033.6068.99
S6615.951951.861542.8237.60568.751352.221139.0665.88
S7325.15697.49620.4250.80379.42839.65710.1075.28
S8408.371182.69866.9051.00449.56934.55838.3575.13
S9465.551097.401016.9150.60497.471172.31946.9874.75
S10544.721542.991145.5546.80539.281147.951021.9671.34
S11591.571578.001204.0542.20580.041292.181132.1966.94
S12636.551811.711500.5936.00609.901407.561159.2865.50
LoPA-Dist-AscendS13412.90911.73911.7350.80515.011235.841090.4576.12
S14525.661546.341143.3748.40619.581424.321310.3575.36
S15625.531729.781435.0646.20689.891644.741356.3672.63
S16716.191780.411558.0043.80770.781589.691480.5671.49
S17796.651798.141687.6939.80837.211782.801517.9067.78
S18896.212586.732086.0436.40897.101868.161642.7266.87

The results illustrate the trade-off between inference throughput and generation quality across varying branch configurations and system backends.

Future Works

We are working on a new inference framework for dLLMs named Diffulex, which is flexible and easy to extend. Diffulex supports multiple decoding strategies including D2F, BlockDiffusion, and Fast-dLLM-v2, which is soon to be released. You can find the code here.

We will explore adapting LoPA to SDAR and other confidence-driven diffusion language models to further demonstrate its generalizability and effectiveness across diverse model architectures.

Reference

[1] Nie, Shen, et al. "Large language diffusion models." arXiv preprint arXiv:2502.09992 (2025).

[2] Wu, Chengyue, et al. "Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding." arXiv preprint arXiv:2505.22618 (2025).

[3] Wang, Xu, et al. "Diffusion llms can do faster-than-ar inference via discrete diffusion forcing." arXiv preprint arXiv:2508.09192 (2025).

[4] Cheng, Shuang, et al. "SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation." arXiv preprint arXiv:2510.06303 (2025).

[5] Fu, Yichao, et al. "Deep think with confidence." arXiv preprint arXiv:2508.15260 (2025).

[6] Ye, Jiacheng, et al. "Dream 7b: Diffusion large language models." arXiv preprint arXiv:2508.15487 (2025).

[7] Gong, Shansan, et al. "DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation." arXiv preprint arXiv:2506.20639 (2025).

BibTeX

@misc{xu2025lopascalingdllminference,
      title={LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding}, 
      author={Chenkai Xu and Yijie Jin and Jiajun Li and Yi Tu and Guoping Long and Dandan Tu and Tianqi Hou and Junchi Yan and Zhijie Deng},
      year={2025},
      eprint={2512.16229},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.16229}, 
}