Skip to content

fix: vLLM 0.17.0 collector compat (DSA, MLA module, MoE)#718

Merged
Arsene12358 merged 4 commits intomainfrom
simonec/vllm-0.17.0-collector-v2
Apr 10, 2026
Merged

fix: vLLM 0.17.0 collector compat (DSA, MLA module, MoE)#718
Arsene12358 merged 4 commits intomainfrom
simonec/vllm-0.17.0-collector-v2

Conversation

@simone-chen
Copy link
Copy Markdown
Contributor

@simone-chen simone-chen commented Apr 10, 2026

Overview:

Fix vLLM 0.17.0 collector compatibility for DSA module, MLA module, and MoE MXFP4 benchmarks on B200. Uses version-routed v2 collector files to isolate 0.17.0 changes from existing collectors.

Details:

DSA module collector (collect_mla_module_v2.py):

  • Deterministic weight/tensor init — vLLM 0.17.0's FlashInfer sparse MLA backend (vllm#33451) and DSA CUDA graph support (vllm#34457) leave CUDA graph RNG offset tracking active after DeepseekV2MLAAttention construction. Any subsequent RNG operation crashes with "Offset increment outside graph capture".
    • enforce_eager and manual_seed() do not clear the state — the corruption originates inside module construction
    • Replace all post-construction RNG (normal_, uniform_, randn, randint) with deterministic fill_()/torch.full()
    • Safe for benchmarking: kernel latency depends on shapes/dtypes, not values; dummy weights are overwritten by process_weights_after_loading() anyway
    • Filed upstream: vllm#39371
  • KV cache scale buffers — vLLM registers k_scale/v_scale as buffers, not parameters. The init loop missed them, leaving sentinel values that fail process_weights_after_loading() (k_scale > 0.0 assertion).
  • auto_map stripping — DeepSeek-V3's config.json has auto_map pointing to configuration_deepseek.py. HuggingFace's AutoConfig.from_pretrained() (called by vLLM's ModelConfig) unconditionally tries to import it from the temp directory where it doesn't exist. Strip it; vLLM natively supports the architecture.

MoE MXFP4 collector (collect_moe_v2.py):

  • Forward context — vLLM 0.17.0's MoERunner abstraction (vllm#32344) routes FusedMoE.forward() through get_forward_context()get_layer_from_name(), requiring the module to be registered in static_forward_context. Share the same VllmConfig between FusedMoE.__init__ and the benchmark's set_forward_context() so the registration is visible.
  • pcp_size — vLLM 0.17.0 added prefill context parallel to FusedMoE (vllm#32344). Pass pcp_size=1 to avoid get_pcp_group() which requires distributed init.
  • is_gated_activation — pass is_gated_activation=True to prepare_static_weights_for_trtllm_fp4_moe() (GPT-OSS uses SwiGLU).

Version routing (registry.py):

  • moe, mla_*_module, dsa_*_module ops use VersionRoute to route to v2 files on vLLM >= 0.17.0, falling back to originals otherwise
  • Existing collector files are untouched — no backward compat risk

Data — clean collection from job 295500035 (0 DSA/MLA module errors). Adds previously missing mla_context_module_perf.txt and mla_generation_module_perf.txt.

Known limitations:

  • 42 MoE MXFP4 weight_scale_vec_size errors — FlashInfer TRTLLM FP4 kernel rejects the weight format; likely needs FlashInfer-side fix
  • 6 MoE MXFP4 test cases with tp_size > 1 fail at FusedMoE.__init__ — requires distributed init not available in standalone collector
  • MLA kernel-level collector (collect_mla.py) fix deferred — vLLM 0.17.0 changed the FlashInferMLAImpl forward API

Where should the reviewer start?

collector/vllm/registry.pycollector/vllm/collect_mla_module_v2.py

Summary by CodeRabbit

  • New Features

    • Added benchmarking support for vLLM 0.17.0 MLA/DSA attention modules with configurable test cases across sequence lengths, batch sizes, and quantization modes.
    • Added Mixture-of-Experts (MoE) performance benchmarking with multiple quantization backend support.
  • Improvements

    • Enabled runtime module selection based on vLLM version compatibility.
    • Updated performance baseline data for B200 SXM systems.

Create version-specific collector files for vLLM >= 0.17.0, isolating
framework version compat from the existing collectors (which continue
to serve vLLM < 0.17.0 unchanged).

New files:
- collect_mla_module_v2.py: deterministic no-RNG init to avoid CUDA
  graph RNG corruption from DSA modules (vllm#39371), auto_map
  stripping, KV cache scale buffer init
- collect_moe_v2.py: shared VllmConfig + set_forward_context for
  MoERunner compat (vllm#32344), pcp_size=1, is_gated_activation

Registry changes:
- moe, mla_*_module, dsa_*_module ops now use VersionRoute to route
  to v2 files on vLLM >= 0.17.0, falling back to originals otherwise

Signed-off-by: Simone Chen <simonec@nvidia.com>
Collected with v2 collector files (0 DSA/MLA module errors):
- dsa_context_module_perf.txt: 9297 lines
- dsa_generation_module_perf.txt: 14905 lines
- mla_context_module_perf.txt: 5425 lines (new)
- mla_generation_module_perf.txt: 5665 lines (new)
- moe_perf.txt: 38152 lines

Signed-off-by: Simone Chen <simonec@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

Sanity Check Chart Generation Report

📥 Download all sanity charts from workflow artifacts

New perf data files were detected in this PR. Please use the link above to
download sanity check charts for the new perf data to compare the collected
perf data vs SOL (theoretical max performance).

Below is a report of whether the chart generation was successful for each op.
If doesn't validate whether the perf data itself is sane.

Chart Generation Report for system: b200_sxm, backend: vllm, backend_version: 0.17.0

  • moe
  • dsa_module
  • dsa_module
  • CLI smoke test ✅

Chart Generation Report for system: b200_sxm, backend: vllm, backend_version: 0.19.0

  • gemm
  • moe
  • CLI smoke test ❌
command / stdout / stderr
command:
aiconfigurator cli default --backend vllm --backend-version 0.19.0 --system b200_sxm --model Qwen/Qwen3-32B --total-gpus 16

stdout:
06:35:16 [aiconfigurator] [I] [main.py:1464] Loading Dynamo AIConfigurator version: 0.8.0
06:35:16 [aiconfigurator] [I] [main.py:1465] Number of top configurations to output: 5 (change with --top-n)
06:35:16 [aiconfigurator] [I] [utils.py:795] Quant inference result: quant_algo=None, kv_cache_quant_algo=None, quant_dynamic=None
06:35:16 [aiconfigurator] [I] [utils.py:894] Loaded model config for Qwen/Qwen3-32B: architecture=Qwen3ForCausalLM, layers=64, n=64, n_kv=8, d=128, hidden_size=5120, inter_size=25600, vocab=151936, context=40960, topk=0, num_experts=0, moe_inter_size=25600, extra_params={'architecture': 'Qwen3ForCausalLM', 'use_qk_norm': True}
06:35:16 [aiconfigurator] [I] [perf_database.py:272] Loading database for system='b200_sxm', backend='vllm', version='0.19.0'
06:35:18 [aiconfigurator] [W] [perf_database.py:3051] Skipping interpolation for z=51200 as it does not exist in both y_left=51200 and y_right=65536
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=57344
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=65536
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=131072
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=262144
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=57344
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=65536
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=131072
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=262144
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=57344
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=65536
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=131072
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=262144
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=57344
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=65536
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=131072
06:35:18 [aiconfigurator] [W] [perf_database.py:3120] Skipping interpolation for z=51200 as it does not exist in both x_left=16384 and x_right=32768 for y=262144
06:35:19 [aiconfigurator] [I] [models.py:149] Resolved quant modes for Agg worker: gemm=GEMMQuantMode.float16 moe=MoEQuantMode.float16 kvcache=KVCacheQuantMode.float16 fmha=FMHAQuantMode.float16 comm=CommQuantMode.half
06:35:19 [aiconfigurator] [I] [models.py:149] Resolved quant modes for Prefill worker: gemm=GEMMQuantMode.float16 moe=MoEQuantMode.float16 kvcache=KVCacheQuantMode.float16 fmha=FMHAQuantMode.float16 comm=CommQuantMode.half
06:35:19 [aiconfigurator] [I] [models.py... (truncated)

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

Walkthrough

The pull request adds two new vLLM benchmarking collector scripts for MLA (multi-head latent attention) and MoE (mixture-of-experts) modules with multi-backend quantization support, updates the registry to enable version-aware module selection, and refreshes performance baseline data via Git LFS.

Changes

Cohort / File(s) Summary
New MLA Benchmarking Collector
collector/vllm/collect_mla_module_v2.py
Added comprehensive benchmarking script for DeepseekV2 MLA and DSA attention variants. Generates test cases across sequence lengths, batch sizes, KV cache dtypes, and GEMM quantization modes (bfloat16, fp8_block, nvfp4). Resolves pre-cached HF configs from symlinked temporary directories, constructs attention modules with dummy weights, applies FP8 quantization post-load, and benchmarks end-to-end forward passes with power measurement. Includes CLI with filtering flags and quick-run mode.
New MoE Benchmarking Collector
collector/vllm/collect_moe_v2.py
Added standalone MoE performance testing script supporting multiple quantization backends (MXFP4, NVFP4, FP8, FP8-block, float16). Dynamically selects supported backends based on GPU SM version and optional imports. Generates synthetic expert weights, configures routing distributions (power-law or balanced), and benchmarks via three execution paths per backend. Includes expert sharding, tensor parallelism constraints, and routing iteration handling.
Registry Versioning
collector/vllm/registry.py
Updated OpEntry declarations for moe, mla_context_module, mla_generation_module, dsa_context_module, and dsa_generation_module to use versions tuple with VersionRoute entries instead of static module fields. Enables runtime module selection based on vLLM version (v2 modules selected for version ≥0.17.0). Updated docstring to explain version resolution logic.
Performance Data (Git LFS)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/*_perf.txt
Updated/added Git LFS pointers for benchmark results: expanded dsa_context_module_perf.txt and dsa_generation_module_perf.txt; added new mla_context_module_perf.txt and mla_generation_module_perf.txt; refreshed moe_perf.txt metadata.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Poem

🐰 Hop along with quantized beams so bright,
MoE and MLA dance through GPU night,
FP8 and FP4 in versions aligned,
Benchmarks and baselines, data-refined!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: vLLM 0.17.0 collector compatibility fixes for DSA, MLA module, and MoE, which aligns with the detailed changes across multiple new v2 collector files and version routing.
Description check ✅ Passed The PR description follows the template structure with Overview, Details, Known limitations, and Where to start sections. It comprehensively covers all major changes, provides upstream issue references, and explains technical context for each fix.
Docstring Coverage ✅ Passed Docstring coverage is 80.77% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
collector/vllm/collect_mla_module_v2.py (2)

88-121: Temp directories are not cleaned up.

The temp directories created by mkdtemp() are cached in _local_config_cache but never removed. For a collector that runs many test cases in a single process, this is likely acceptable (OS cleans /tmp periodically). However, if this concern is raised:

💡 Optional: Register cleanup with atexit
import atexit
import shutil

def _cleanup_temp_dirs():
    for tmp_dir in _local_config_cache.values():
        try:
            shutil.rmtree(tmp_dir, ignore_errors=True)
        except Exception:
            pass

atexit.register(_cleanup_temp_dirs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@collector/vllm/collect_mla_module_v2.py` around lines 88 - 121, _temp
directories created by _resolve_model_path via tempfile.mkdtemp are cached in
_local_config_cache but never cleaned up; add a cleanup routine and register it
with atexit to remove those temp dirs on process exit (use shutil.rmtree with
ignore_errors=True) and ensure the routine iterates over _local_config_cache
values; implement a helper function (e.g. _cleanup_temp_dirs) and call
atexit.register(_cleanup_temp_dirs) so mkdtemp-created dirs are removed when the
process ends.

415-438: Minor comment/code mismatch on scale initialization.

Comment on line 417 says "Scale params → 1.0" but line 435 uses fill_(0.5). The 0.5 value works fine (avoids NaN during processing), but the comment is slightly misleading.

📝 Suggested fix
     # Initialize with random weights.
     # FP8 weights → zero (safe dummy value).
-    # Scale params → 1.0 (avoid NaN during process_weights_after_loading).
+    # Scale params → 0.5 (avoid NaN during process_weights_after_loading).
     # Everything else → small constant.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@collector/vllm/collect_mla_module_v2.py` around lines 415 - 438, The comment
and code disagree about the initial value for "scale" params: the comment says
"Scale params → 1.0" but the loop in attn_module initialization sets scale
tensors with tensor.data.fill_(0.5); update the comment to state "Scale params →
0.5" (or change the fill_ call to 1.0 if you prefer that behavior) so the
documentation matches the implementation; locate the loop that iterates over
attn_module.named_parameters()/named_buffers() and the branch that checks
tensor.dtype == torch.float32 and "scale" in name to make the change.
collector/vllm/collect_moe_v2.py (1)

248-251: Consider using deterministic initialization for bias tensors.

The PR objectives note that vLLM 0.17.0 has CUDA graph RNG offset tracking issues. While collect_mla_module_v2.py uses fill_() to avoid RNG calls, this code uses normal_() for bias initialization. If the MXFP4 path is used after DSA module collection in the same process, this could trigger RNG offset errors.

Since bias values don't affect kernel latency, consider using fill_() for consistency:

🛡️ Suggested safer initialization
             if hasattr(moe_module, "w13_bias"):
-                moe_module.w13_bias.data.normal_()
+                moe_module.w13_bias.data.fill_(0.01)
             if hasattr(moe_module, "w2_bias"):
-                moe_module.w2_bias.data.normal_()
+                moe_module.w2_bias.data.fill_(0.01)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@collector/vllm/collect_moe_v2.py` around lines 248 - 251, The bias
initialization in collect_moe_v2.py uses nondeterministic normal_() on
moe_module.w13_bias and moe_module.w2_bias which can break CUDA graph RNG offset
tracking; change these to deterministic in-place fills (e.g., use .data.fill_(0)
or another fixed constant) inside the same attribute checks for
moe_module.w13_bias and moe_module.w2_bias so no RNG is invoked during module
collection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@collector/vllm/collect_mla_module_v2.py`:
- Around line 88-121: _temp directories created by _resolve_model_path via
tempfile.mkdtemp are cached in _local_config_cache but never cleaned up; add a
cleanup routine and register it with atexit to remove those temp dirs on process
exit (use shutil.rmtree with ignore_errors=True) and ensure the routine iterates
over _local_config_cache values; implement a helper function (e.g.
_cleanup_temp_dirs) and call atexit.register(_cleanup_temp_dirs) so
mkdtemp-created dirs are removed when the process ends.
- Around line 415-438: The comment and code disagree about the initial value for
"scale" params: the comment says "Scale params → 1.0" but the loop in
attn_module initialization sets scale tensors with tensor.data.fill_(0.5);
update the comment to state "Scale params → 0.5" (or change the fill_ call to
1.0 if you prefer that behavior) so the documentation matches the
implementation; locate the loop that iterates over
attn_module.named_parameters()/named_buffers() and the branch that checks
tensor.dtype == torch.float32 and "scale" in name to make the change.

In `@collector/vllm/collect_moe_v2.py`:
- Around line 248-251: The bias initialization in collect_moe_v2.py uses
nondeterministic normal_() on moe_module.w13_bias and moe_module.w2_bias which
can break CUDA graph RNG offset tracking; change these to deterministic in-place
fills (e.g., use .data.fill_(0) or another fixed constant) inside the same
attribute checks for moe_module.w13_bias and moe_module.w2_bias so no RNG is
invoked during module collection.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 46e56a3a-7d6a-4c36-b281-2960f156d8ac

📥 Commits

Reviewing files that changed from the base of the PR and between db7d6ee and 873edbc.

📒 Files selected for processing (8)
  • collector/vllm/collect_mla_module_v2.py
  • collector/vllm/collect_moe_v2.py
  • collector/vllm/registry.py
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_context_module_perf.txt
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_generation_module_perf.txt
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_context_module_perf.txt
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_generation_module_perf.txt
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/moe_perf.txt

Combined gemm and moe performance data from two pipeline runs.
Attention/MLA/DSA collection had errors and is not included.

Signed-off-by: Simone Chen <simonec@nvidia.com>
Rename collect_moe.py -> collect_moe_v1.py and
collect_mla_module.py -> collect_mla_module_v1.py to satisfy the
test_versioned_modules_use_vn_suffix registry integrity check.

Signed-off-by: Simone Chen <simonec@nvidia.com>
@Arsene12358 Arsene12358 merged commit 22d40fd into main Apr 10, 2026
8 checks passed
@Arsene12358 Arsene12358 deleted the simonec/vllm-0.17.0-collector-v2 branch April 10, 2026 08:28
SCP24317628 pushed a commit to SCP24317628/aiconfigurator that referenced this pull request Apr 15, 2026
* feat: vLLM 0.17.0 collector v2 files with version routing

Create version-specific collector files for vLLM >= 0.17.0, isolating
framework version compat from the existing collectors (which continue
to serve vLLM < 0.17.0 unchanged).

New files:
- collect_mla_module_v2.py: deterministic no-RNG init to avoid CUDA
  graph RNG corruption from DSA modules (vllm#39371), auto_map
  stripping, KV cache scale buffer init
- collect_moe_v2.py: shared VllmConfig + set_forward_context for
  MoERunner compat (vllm#32344), pcp_size=1, is_gated_activation

Registry changes:
- moe, mla_*_module, dsa_*_module ops now use VersionRoute to route
  to v2 files on vLLM >= 0.17.0, falling back to originals otherwise

Signed-off-by: Simone Chen <simonec@nvidia.com>

* data: add clean vLLM 0.17.0 perf data from v2 collector (job 295500035)

Collected with v2 collector files (0 DSA/MLA module errors):
- dsa_context_module_perf.txt: 9297 lines
- dsa_generation_module_perf.txt: 14905 lines
- mla_context_module_perf.txt: 5425 lines (new)
- mla_generation_module_perf.txt: 5665 lines (new)
- moe_perf.txt: 38152 lines

Signed-off-by: Simone Chen <simonec@nvidia.com>

* data: add vLLM 0.19.0 perf data for b200_sxm (gemm + moe)

Combined gemm and moe performance data from two pipeline runs.
Attention/MLA/DSA collection had errors and is not included.

Signed-off-by: Simone Chen <simonec@nvidia.com>

* fix: rename vllm collector modules to follow _vN suffix convention

Rename collect_moe.py -> collect_moe_v1.py and
collect_mla_module.py -> collect_mla_module_v1.py to satisfy the
test_versioned_modules_use_vn_suffix registry integrity check.

Signed-off-by: Simone Chen <simonec@nvidia.com>

---------

Signed-off-by: Simone Chen <simonec@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants