fix: Fix fp8 after vllm v0.11.2 bump by guyueh1 · Pull Request #1660 · NVIDIA-NeMo/RL

guyueh1 · 2025-12-19T00:07:22Z

What does this PR do ?

Fix FP8 patches after vllm bump to v0.11.2

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added new configuration profiles for DeepSeek v3, Llama 3.1, and Qwen3 models with optimized settings for various cluster sizes
- Enhanced FP8 quantization support with DeepGEMM optimization for improved inference performance
Improvements
- Optimized weight handling and MoE inference processing for better compatibility with multiple backends

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-19T00:12:38Z

📝 Walkthrough

Walkthrough

Multiple new LLM performance configuration files added for GRPO variants across different cluster scales, one existing config simplified by removing sequence packing settings, and significant refactoring of FP8 weight post-processing logic in the generation module to support DeepGEMM optimization and conditional MoE backend handling.

Changes

Cohort / File(s)	Summary
Configuration updates—new performance tuning files `examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml`, `grpo-deepseek-v3-64n4g-async-1off.yaml`, `grpo-deepseek-v3-64n8g-fp8-async-1off.yaml`, `grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml`, `grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml`, `grpo-qwen3-235b-16n4g.yaml`, `grpo-qwen3-235b-32n4g-async-1off.yaml`	New YAML configuration files for GRPO recipe variants specifying cluster topology (GPU per node, node counts), checkpoint/log directories, Megatron parallelism settings, VLLM generation parameters, and FP8 quantization settings where applicable.
Configuration cleanup `examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml`	Removed sequence_packing configuration (modified_ffd algorithm) and expert_parallel_size settings from policy and generation sections.
FP8 weight post-processing refactor `nemo_rl/models/generation/fp8.py`	Unconditional FP8 KV cache patching, in-place weight updates via `copy_()` instead of direct assignment, replaced `rocm_aiter_moe` flag with `rocm_aiter_ops.is_fused_moe_enabled()`, added conditional `flashinfer_moe_backend` handling with weight swapping, integrated `deepgemm_post_process_fp8_weight_block()` for unified weight post-processing, and updated parameter copying logic for processed weights.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

nemo_rl/models/generation/fp8.py: Complex refactoring with multiple conditional branches for MoE backends (flashinfer vs. non-flashinfer), weight swapping logic, and DeepGEMM integration—requires careful verification of in-place operations and parameter tensor updates.
Configuration files: Largely repetitive and straightforward; verify parameter consistency across variants (e.g., pipeline_model_parallel_size, tensor_parallel_size, layer splits, and FP8 settings).

Possibly related PRs

feat: Fp8 moe rollout #1446: Overlapping FP8 MoE weight-processing changes and MoE patching logic in the same file.
fix: Fix process_weights_after_loading for fp8 dense #1432: Related FP8 weight post-processing and scale-handling modifications to nemo_rl/models/generation/fp8.py.
feat: KV cache quantization support in fp8 rollout in GRPO #1212: Intersecting KV-cache FP8 support and kv_cache_dtype processing changes in the same module.

Suggested labels

Low Precision, CI:L2

Suggested reviewers

terrykong
parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	PR contains major FP8 handling changes affecting numerical accuracy and convergence but lacks test results, performance metrics, or regression validation in its description.	Include test results comparing FP8 outputs before/after vLLM bump, convergence curves from training runs, performance metrics, and configuration validation results.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main change as fixing FP8 compatibility after a vllm v0.11.2 dependency bump, which aligns with the core code change in nemo_rl/models/generation/fp8.py and multiple configuration updates.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48dbb37 and 92421fe.

📒 Files selected for processing (9)

examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml (0 hunks)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (1 hunks)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (1 hunks)
nemo_rl/models/generation/fp8.py (2 hunks)

💤 Files with no reviewable changes (1)

examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml

🧰 Additional context used

📓 Path-based instructions (5)

examples/configs/recipes/**/*.yaml