-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][feat] AutoDeploy: Perf improvement for small batch size #9163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][feat] AutoDeploy: Perf improvement for small batch size #9163
Conversation
Signed-off-by: Chenghao Zhang <[email protected]>
📝 WalkthroughWalkthroughTwo files in the auto_deploy module are modified: quant.py adds runtime CUDA capability detection and conditional enable_cuda_core flag based on SM 8.9 or 12.0 support, narrowing the CUDA-core path to specific hardware; nemotron_h.py introduces a new optimized forward method for NemotronHTopkRouter using fused CUDA kernel-based top-k routing, registered via module patching. Changes
Sequence DiagramsequenceDiagram
participant caller as Caller
participant forward as _nemotron_h_topk_router_forward
participant reshape as Reshape Input
participant linear as Router Linear
participant kernel as noaux_tc_op Kernel
participant return as Return Results
caller->>forward: hidden_states
forward->>reshape: reshape(hidden_states)
reshape->>linear: reshaped_input
linear->>linear: compute router logits
linear->>kernel: logits
rect rgba(100, 150, 200, 0.2)
note over kernel: CUDA kernel execution<br/>(top-k selection)
end
kernel->>kernel: extract top-k weights & indices
kernel->>return: weights, indices
return->>caller: (indices, weights)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (1)
91-117: Fused router forward looks consistent with existing MOE interface; considerreshapeinstead ofview.The new
_nemotron_h_topk_router_forwardkeeps the contract thatself.gate(hidden_states)returns(topk_indices, topk_weights)in the shape expected bytorch_ops.auto_deploy.torch_moe, without extra reshaping, which is aligned with the NemotronH MOE usage pattern. Based on learnings.One minor robustness tweak:
hidden_states = hidden_states.view(-1, self.config.hidden_size)assumes thathidden_statesis contiguous. To avoid surprises if a non‑contiguous tensor ever reaches this router, usingreshape(or.contiguous().view(...)) would be safer:- hidden_states = hidden_states.view(-1, self.config.hidden_size) + hidden_states = hidden_states.reshape(-1, self.config.hidden_size)Behavior otherwise looks correct: logits are computed in fp32, and
noaux_tc_opreceives logits along withe_score_correction_bias,n_group,topk_group,top_k, androuted_scaling_factorin a sensible order.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py(1 hunks)tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
Applied to files:
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (3)
n_group(241-241)topk_group(243-243)top_k(240-240)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (1)
169-169: Patch registration forNemotronHTopkRouterforward looks correct.Binding
_nemotron_h_topk_router_forwardviaCUSTOM_MODULE_PATCHES["NemotronHTopkRouter"]is consistent with the other NemotronH patches and should seamlessly swap in the fused router implementation at load time.
|
Could you please also post accuracy numbers for tp1, tp2 |
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
|
Had to use the updated model to get a good accuracy number - The accuracy I got is: MMLU: 73.879, gsm8k: 86.884 |
|
/bot run |
|
PR_Github #24797 [ run ] triggered by Bot. Commit: |
|
PR_Github #24797 [ run ] completed with state |
|
/bot run |
|
PR_Github #24808 [ run ] triggered by Bot. Commit: |
|
PR_Github #24808 [ run ] completed with state |
|
/bot run |
|
PR_Github #24902 [ run ] triggered by Bot. Commit: |
|
PR_Github #24902 [ run ] completed with state |
|
/bot run |
|
PR_Github #24934 [ run ] triggered by Bot. Commit: |
|
PR_Github #24934 [ run ] completed with state |
…A#9163) Signed-off-by: Chenghao Zhang <[email protected]> Co-authored-by: Suyog Gupta <[email protected]> Signed-off-by: lkomali <[email protected]>
Summary by CodeRabbit
Release Notes
For nemotron MOE: