Skip to content

jianuo-huang/MaskKV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation


MaskKV: Fine-Grained Cache Eviction for Efficient dLLM Inference

MaskKV Paper on arXiv

Updates

  • (10/2025) Initial release of MaskKV (official implementation).
  • Supported Models: LLaDA-8B, Dream-7B.
  • Supported Methods: MaskKV, SnapKV, PyramidKV, SqueezeAttention, adaKV.

To-Do List

Planned tasks and improvements for MaskKV:

  • Refactor the codebase for better modularity and readability
  • Release the official code

Overview

MaskKV is a training-free framework that unlocks efficient long-context inference for Diffusion Large Language Models (dLLMs). It drastically slashes memory usage and accelerates performance, all while preserving high accuracy.

While dLLMs offer powerful parallel decoding, they are bottlenecked by the massive memory costs of caching the entire sequence at every step. MaskKV tackles this challenge with a fine-grained KV cache pruning strategy that leverages signals unique to the diffusion process, moving beyond the limitations of heuristics designed for autoregressive models.


Core Techniques

🟩 Mask-Voting

  • Leverages the attention patterns from [MASK] tokens, which act as strong indicators of token importance.
  • Selectively retains critical prompt tokens in the KV cache.

🧩 Adaptive Budget Allocation

  • Hierarchically allocates cache budget across layers and attention heads.
  • Prioritizes:
    • Boundary layers (with the strongest contextual impact)
    • High-Prompt-Preference heads (most sensitive to prompt semantics)

Key Results

Experiments on LLaDA-8B and Dream-7B show consistent gains in accuracy, efficiency, and memory usage.

Metric Result
Accuracy Retains up to 98.7% (Dream-7B) and 94.3% (LLaDA-8B) of full-cache performance on LongBench with the cache compressed to just 256 KV pairs (<5% of tokens).
Speed Achieves up to 31× faster decoding at 32k prompt length
Memory Reduces peak memory by 65%, enabling 8× longer prompts on a single GPU

Acknowledgement

This work is developed based on the open-source project dLLM-Cache, which provides an adaptive caching framework for diffusion large language models.
We also incorporate ideas and implementation details inspired by the following open-source projects:

Furthermore, our experiments leverage diffusion language models from:

We sincerely thank the authors of these works for their open-source contributions, which greatly facilitated the development of MaskKV.

Citation

@misc{huang2025masktokensprophetfinegrained,
      title={Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference}, 
      author={Jianuo Huang and Yaojie Zhang and Yicun Yang and Benhao Huang and Biqing Qi and Dongrui Liu and Linfeng Zhang},
      year={2025},
      eprint={2510.09309},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.09309}, 
}

Star History

Star History Chart

About

This repository is the official implementation for MaskKV.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •