Conversation
Summary of ChangesHello @ch-wan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the management of input buffers for CUDA graphs, separating buffer definitions and population logic into distinct, phase-specific dataclasses. The introduction of a generic Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request effectively refactors the CUDA graph input buffer management, improving code organization and extensibility. The introduction of the ForwardInputBuffers base class with a generic buffer pooling mechanism is a solid design choice. The creation of specialized DecodeInputBuffers and PrefillInputBuffers subclasses successfully encapsulates phase-specific logic, making the code cleaner and more maintainable. The changes across CudaGraphRunner, PiecewiseCudaGraphRunner, and ModelRunner are consistent with this new design. I have one minor suggestion for improving an assertion message for better clarity.
| assert isinstance( | ||
| buffer, torch.Tensor | ||
| ), f"Field {name} is expected to be a torch.Tensor or a dict of torch.Tensor, but got {type(buffer)}." |
There was a problem hiding this comment.
The assertion message here is slightly misleading. Since the elif isinstance(buffer, dict): block handles dictionaries, the else block will only be reached by types other than dict and None. The error message should reflect that it expects a torch.Tensor at this point, not a torch.Tensor or a dict.
| assert isinstance( | |
| buffer, torch.Tensor | |
| ), f"Field {name} is expected to be a torch.Tensor or a dict of torch.Tensor, but got {type(buffer)}." | |
| assert isinstance( | |
| buffer, torch.Tensor | |
| ), f"Field {name} is expected to be a torch.Tensor, but got {type(buffer)}." |
da9e79d to
70bac89
Compare
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
704ea92 to
f6a106d
Compare
f6a106d to
a4a295e
Compare
This reverts commit 84c67c8.
…o xverse_moe * 'xverse_moe' of https://github.com/xiaobaicxy/sglang: (275 commits) fix: add missing blank line after docstring in serving_transcription.py (sgl-project#19206) Whisper model support & `/v1/audio/transcriptions` endpoint & benchmark (sgl-project#16983) fix: patch docker image fixes (sgl-project#19100) [PD-Disagg] Unify prefill info data transition flow, all with `PrefillServerInfo` (sgl-project#19195) [CI] Tiny enhance the dp attention load blance benchmark (sgl-project#19194) add new ci user (sgl-project#19133) [CI] fix the teardown output of disaggregation test (sgl-project#19193) [PD-Disagg] Support query dp rank from bootstrap server. (sgl-project#19168) [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-project#18949) [Diffusion] Match rotary_embedding module name style (sgl-project#19179) [Refactor] Split rotary_embedding.py into a modular package (sgl-project#19144) [NPU] bump sgl-kernel-npu to 2026.02.01.post2 (sgl-project#19178) Use single mma warp group for short q_len in FA to optimize decoding performance (sgl-project#18985) Reorganize topk logic to clean up code and expose logical experts (sgl-project#16945) [ROCm] Use unreg path for custom all-reduce during CUDA graph capture (sgl-project#19162) [diffusion] feat: detect Flux2 custom VAE path from component_paths (sgl-project#19170) [AMD] ENV flags tuning and cleanup (sgl-project#19176) Fix bench_one_batch_server by moving the print statements (sgl-project#19175) Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization (sgl-project#19091) Revert "Refactor graph input buffers (sgl-project#18991)" (sgl-project#19173) ...
Motivation
The CUDA graph input buffer management was concentrated in a monolithic
GraphInputBuffersclass that mixed decode, prefill, and speculative decoding concerns. Additionally, each cuda graph runner (decode, prefill, eagle draft, eagle draft-extend) allocated its own independent tensor buffers, missing opportunities to share memory across runners that use buffers with the same name, dtype, and device.This PR refactors the buffer management into:
ForwardInputBuffersbase class with a shared buffer pool (_forward_input_buffer_pool) that automatically reuses the largest existing allocation for each named buffer.DecodeInputBuffers,PrefillInputBuffers,EagleDraftInputBuffers, etc.) that declare their fields and are pooled viashare_buffers().Modifications
input_buffers.py—ForwardInputBuffersbase classGraphInputBufferswith a lightweight base dataclass.share_buffers()iterates over all tensor fields and registers them in a module-level_forward_input_buffer_pool. If a buffer with the same name already exists and is larger, the larger one is reused viaas_strided.torch.is_inference_mode_enabled()to prevent inference-mode tensors from contaminating the pool (they cannot be updated in-place outsideInferenceMode, which would break CUDA graph capture).cuda_graph_runner.py—DecodeInputBufferscreate(), andpopulate_from_forward_batch()into a newDecodeInputBuffers(ForwardInputBuffers)dataclass.populate_from_forward_batchno longer returnsseq_lens_cpu; callers accessbuffers.seq_lens_cpu[:bs]directly.piecewise_cuda_graph_runner.py—PrefillInputBuffersPrefillInputBuffers(ForwardInputBuffers)for prefill-phase buffers.self.<field>attributes into a singleself.buffersobject; all access updated toself.buffers.<field>.eagle_draft_cuda_graph_runner.py—EagleDraftInputBuffersEagleDraftInputBuffers(ForwardInputBuffers)with fields:input_ids,req_pool_indices,out_cache_loc,positions,mrope_positions,seq_lens,seq_lens_cpu,extend_seq_lens,topk_p,topk_index,hidden_states, and optional DP gather buffers.self.<field>buffer accesses inEAGLEDraftCudaGraphRunnerreplaced withself.buffers.<field>.eagle_draft_extend_cuda_graph_runner.py—EagleDraftExtendInputBuffersEagleDraftExtendInputBuffers(ForwardInputBuffers)with fields:input_ids,req_pool_indices,out_cache_loc,positions,mrope_positions,hidden_states,seq_lens,seq_lens_cpu,extend_seq_lens,accept_length,next_token_logits_buffer, and optional DP gather buffers.self.<field>buffer accesses inEAGLEDraftExtendCudaGraphRunnerreplaced withself.buffers.<field>.multi_layer_eagle_draft_extend_cuda_graph_runner.py—MultiLayerEagleDraftExtendInputBuffersMultiLayerEagleDraftExtendInputBuffers(ForwardInputBuffers)for multi-layer EAGLE.self.buffers.<field>.self.next_cuda_graph_runner.buffers.<field>.model_runner.pyDecodeInputBuffersfromcuda_graph_runnerinstead of the removedGraphInputBuffersfrominput_buffers.Checklist