GitHub - jianuo-huang/MaskKV: This repository is the official implementation for MaskKV.

MaskKV: Fine-Grained Cache Eviction for Efficient dLLM Inference

Updates

(10/2025) Initial release of MaskKV (official implementation).
Supported Models: LLaDA-8B, Dream-7B.
Supported Methods: MaskKV, SnapKV, PyramidKV, SqueezeAttention, adaKV.

To-Do List

Planned tasks and improvements for MaskKV:

Refactor the codebase for better modularity and readability
Release the official code

Overview

MaskKV is a training-free framework that unlocks efficient long-context inference for Diffusion Large Language Models (dLLMs). It drastically slashes memory usage and accelerates performance, all while preserving high accuracy.

While dLLMs offer powerful parallel decoding, they are bottlenecked by the massive memory costs of caching the entire sequence at every step. MaskKV tackles this challenge with a fine-grained KV cache pruning strategy that leverages signals unique to the diffusion process, moving beyond the limitations of heuristics designed for autoregressive models.

Core Techniques

🟩 Mask-Voting

Leverages the attention patterns from [MASK] tokens, which act as strong indicators of token importance.
Selectively retains critical prompt tokens in the KV cache.

🧩 Adaptive Budget Allocation

Hierarchically allocates cache budget across layers and attention heads.
Prioritizes:
- Boundary layers (with the strongest contextual impact)
- High-Prompt-Preference heads (most sensitive to prompt semantics)

Key Results

Experiments on LLaDA-8B and Dream-7B show consistent gains in accuracy, efficiency, and memory usage.

Metric	Result
Accuracy	Retains up to 98.7% (Dream-7B) and 94.3% (LLaDA-8B) of full-cache performance on LongBench with the cache compressed to just 256 KV pairs (<5% of tokens).
Speed	Achieves up to 31× faster decoding at 32k prompt length
Memory	Reduces peak memory by 65%, enabling 8× longer prompts on a single GPU

Acknowledgement

This work is developed based on the open-source project dLLM-Cache, which provides an adaptive caching framework for diffusion large language models.
We also incorporate ideas and implementation details inspired by the following open-source projects:

Furthermore, our experiments leverage diffusion language models from:

Dream
LLaDA

We sincerely thank the authors of these works for their open-source contributions, which greatly facilitated the development of MaskKV.

Citation

@misc{huang2025masktokensprophetfinegrained,
      title={Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference}, 
      author={Jianuo Huang and Yaojie Zhang and Yicun Yang and Benhao Huang and Biqing Qi and Dongrui Liu and Linfeng Zhang},
      year={2025},
      eprint={2510.09309},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.09309}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MaskKV: Fine-Grained Cache Eviction for Efficient dLLM Inference

Updates

To-Do List

Overview

Core Techniques

🟩 Mask-Voting

🧩 Adaptive Budget Allocation

Key Results

Acknowledgement

Citation

Star History

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

jianuo-huang/MaskKV

Folders and files

Latest commit

History

Repository files navigation

MaskKV: Fine-Grained Cache Eviction for Efficient dLLM Inference

Updates

To-Do List

Overview

Core Techniques

🟩 Mask-Voting

🧩 Adaptive Budget Allocation

Key Results

Acknowledgement

Citation

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages