fix: Fix fp8 after vllm v0.11.2 bump#1660
Conversation
📝 WalkthroughWalkthroughMultiple new LLM performance configuration files added for GRPO variants across different cluster scales, one existing config simplified by removing sequence packing settings, and significant refactoring of FP8 weight post-processing logic in the generation module to support DeepGEMM optimization and conditional MoE backend handling. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml(0 hunks)examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml(1 hunks)examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml(1 hunks)nemo_rl/models/generation/fp8.py(2 hunks)
💤 Files with no reviewable changes (1)
- examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml
🧰 Additional context used
📓 Path-based instructions (5)
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Files:
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yamlnemo_rl/models/generation/fp8.pyexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
nemo_rl/models/generation/fp8.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/models/generation/fp8.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
nemo_rl/models/generation/fp8.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Applied to files:
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yamlexamples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
📚 Learning: 2025-09-18T14:20:36.297Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-8b-base-2n8g-fsdp2tp2.v1.yaml:113-120
Timestamp: 2025-09-18T14:20:36.297Z
Learning: In distillation workflows, the teacher policy does not perform generation - it only does inference/logprob computation on sequences generated by the student policy. Therefore, teacher generation configuration mismatches (like vLLM tensor parallelism settings) and colocation concerns are not relevant.
Applied to files:
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes
Applied to files:
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yamlexamples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yamlexamples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/**/*.yaml : When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Applied to files:
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (14)
examples/configs/recipes/llm/performance/grpo-qwen3-235b-16n4g.yaml (4)
1-3: ✓ Configuration inheritance and checkpoint naming are consistent.The file correctly references the base 16n8g configuration and names the checkpoint directory to match the variant.
7-10: Verify pipeline layer distribution for uniform load balancing.The configuration specifies
num_layers_in_first_pipeline_stage: 23andnum_layers_in_last_pipeline_stage: 23, but does not specify the distribution for intermediate stages. Withpipeline_model_parallel_size: 4, there are two intermediate stages whose layer counts are undefined. This could result in uneven computation distribution across pipeline stages.Verify that the layer counts across all four pipeline stages sum correctly and are appropriately balanced for the 235B model.
11-13: Clarify FP8 configuration alignment with PR purpose.The PR description states the goal is to "Fix FP8 patches after vllm bump to v0.11.2," but this configuration file does not include FP8-specific quantization settings (e.g.,
weights_dtype,quantization_type, oruse_fp8_kv_cache).Verify whether FP8 configuration is inherited from the base config (
grpo-qwen3-235b-16n8g.yaml), applied globally via defaults, or should be explicitly set here for the vllm v0.11.2 compatibility fix.
14-20: ✓ Logger and cluster configuration are consistent and well-organized.The naming convention is consistently applied across checkpoint, log, and wandb directories, and the cluster definition (16 nodes, 4 GPUs per node) correctly represents the target infrastructure.
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (3)
1-20: LGTM! Well-structured FP8 configuration recipe.The file follows the correct naming convention and is properly located. The configuration cleanly extends the base async recipe with FP8-specific overrides for both training (Megatron) and generation (vLLM) components, with consistent naming across checkpoint, log, and wandb paths.
1-1: Base configuration file exists and is correctly referenced.The file
./grpo-llama3.1-8b-instruct-2n8g-async-1off.yamlexists in the same directory and is properly referenced by the relative path in defaults.
6-12: FP8 configuration is correct.The blockwise FP8 recipe with e4m3 format is well-documented in NVIDIA Transformer Engine and appropriate for training. The NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 environment variable relaxes the default constraint that scales be powers of 2, which is a standard setting. Float8BlockScaling uses block-wise scaling for FP8 tensors where values within each block share a common scaling factor, and blockwise is a valid fp8_recipe choice in training configurations.
examples/configs/recipes/llm/performance/grpo-qwen3-235b-32n4g-async-1off.yaml (1)
1-20: LGTM!Configuration follows the naming conventions and the parallelism math is consistent: 128 GPUs (32×4) with TP=8 and PP=4 yields 4 data-parallel replicas. Note that TP=8 with 4 GPUs per node implies tensor parallelism spans across 2 nodes, which is valid for large models but relies on high-speed inter-node interconnects. Based on learnings, the recipe YAML naming pattern is correctly applied.
nemo_rl/models/generation/fp8.py (3)
592-600: Correct use of in-place copy for weight scale updates.Using
copy_()preserves the existingtorch.nn.Parameterobject and itsweight_loaderattribute, which is required for model refit functionality. This is consistent with the approach used inmaybe_post_process_fp8_weight_block.
626-659: LGTM! Clean unified weight post-processing logic.The refactored logic correctly handles all code paths:
- Flashinfer backend (swap w13 to w31) vs. default (use original)
- DeepGEMM enabled (apply post-processing) vs. disabled (skip)
- Always copy processed weights back using
copy_()to preserve ParametersThe unconditional final copy ensures both modified and unmodified weight paths correctly update the layer's parameters while maintaining the
weight_loaderattribute for refit.
610-621: Verify AITER enablement approach for vLLM v0.11.2 compatibility.The reference to
rocm_aiter_ops.is_fused_moe_enabled()appears inconsistent with vLLM v0.11.2's AITER API design. vLLM uses environment variable switches (VLLM_ROCM_USE_AITERas master switch andVLLM_ROCM_USE_AITER_MOEfor specific features) to control AITER kernel selection rather than runtime method calls. Confirm whether the code should query environment variables or if the import path and method signature are correct for the target vLLM version.examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-async-1off.yaml (1)
12-13: LGTM: Pipeline parallelism made explicit.The addition explicitly sets pipeline parallelism to 1, making the configuration clearer and more maintainable.
examples/configs/recipes/llm/performance/grpo-deepseek-v3-32n4g.yaml (1)
1-21: LGTM: Clean configuration with consistent naming.The configuration properly scales down to 32 nodes × 4 GPUs with appropriate parallelism settings. The log directory and WandB names correctly match the cluster configuration.
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.yaml (1)
6-17: Both FP8 settings are correctly configured and compatible with this project's vLLM integration.The FP8 configuration in the file correctly implements end-to-end FP8 for both training and generation. Verification confirms:
vllm_cfg.precision: "fp8"- This parameter is correctly named and actively used in the codebase. The value is validated innemo_rl/models/generation/fp8.py(line 243) where it checksif vllm_cfg.get("precision") == "fp8"to enable FP8 weight quantization.
vllm_cfg.use_deep_gemm: true- This parameter is correctly spelled and implemented. It's used innemo_rl/models/generation/fp8.py(line 271) where it sets the environment variablesVLLM_USE_DEEP_GEMM="1"andVLLM_USE_DEEP_GEMM_E8M0="0"to enable Deep GEMM optimization for FP8 operations.Both parameters are used consistently across multiple FP8 configuration files in the repository and are validated against the project's implementation, not the upstream vLLM v0.11.2 API directly (these are custom vllm_cfg parameters specific to NeMo RL's integration layer).
examples/configs/recipes/llm/performance/grpo-deepseek-v3-64n4g-async-1off.yaml
Outdated
Show resolved
Hide resolved
examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml
Outdated
Show resolved
Hide resolved
|
Draft for now because I messed up with rebase, will update |
Signed-off-by: Guyue Huang <guyueh@nvidia.com> fix fp8 for vllm v0.5 Signed-off-by: Guyue Huang <guyueh@nvidia.com> Revert "Perf recipe for v0.5" This reverts commit c35d0f4. Fix Fp8 sequence padding for PP>1 case Signed-off-by: root <root@pool0-00514.cm.cluster> fix patching for MP>1 Signed-off-by: Guyue Huang <guyueh@nvidia.com>
b9b3bad to
8d82ada
Compare
|
@terrykong this is ready but still no coverage on the FP8 code part |
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
What does this PR do ?
Fix FP8 patches after vllm bump to v0.11.2
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.