feat: preparing TRTLLM MoE backend to support more kernels#2794
feat: preparing TRTLLM MoE backend to support more kernels#2794rosenrodt wants to merge 2 commits intoflashinfer-ai:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR enhances MoE benchmarking with actual routed-token metrics and CUDA profiler support, refactors batched GEMM dimension handling with validation improvements, introduces a generic launcher selector mechanism, adjusts dtype parameter ordering in MoE runners, and adds logging infrastructure. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the flexibility and robustness of the TRTLLM Mixture-of-Experts (MoE) backend. By moving from hardcoded assumptions to dynamic determination of matrix transposition and introducing a mechanism to filter out problematic kernels, it lays crucial groundwork for supporting a broader array of future kernels. Additionally, the changes provide more precise performance metrics in benchmarks, offering deeper insights into MoE computation efficiency. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant refactoring to the TRTLLM MoE backend, preparing it to support a wider variety of kernels, including non-transposed ones. The changes are well-structured and align with the goal of making the backend more robust and extensible. Key improvements include generalizing the TrtllmGenBatchedGemmRunner to handle kernel-specific transposeMmaOutput settings, refactoring problem dimension setup into a helper function, and adding a skipQuirks function to avoid problematic kernel configurations. The benchmark scripts have also been enhanced with bandwidth calculation and more accurate TFLOPS metrics. I've identified a couple of minor issues with duplicated parameters in log messages and have provided suggestions for them. Overall, this is a solid contribution.
| FLASHINFER_LOG("NumBatches", numBatches, ", MaxNumCgasInBatchDim", maxNumCtasInBatchDim, | ||
| ", MaxNumCtasInBatchDim", maxNumCtasInBatchDim, ", ShapeMNK", |
| ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim, | ||
| ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim, ")"); |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
include/flashinfer/exception.h (1)
115-128: Log class is consistent with Warning class.The implementation correctly mirrors the
Warningclass pattern. Both usestd::cerrfor output, which is thread-safe in C++11+.Consider whether log-level filtering (controlled by environment variable) would be useful for production debugging. The existing Python-side logging uses
FLASHINFER_LOGLEVELenvironment variable per the learnings. A similar mechanism for C++ logs could provide parity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/flashinfer/exception.h` around lines 115 - 128, The Log class should respect a FLASHINFER_LOGLEVEL environment variable like the Python logger and the existing Warning class; modify Log (constructor or emit method) to read FLASHINFER_LOGLEVEL (e.g., parse into levels like ERROR/WARN/INFO/DEBUG), store or compare a numeric threshold, and only write to std::cerr when the message level meets the configured threshold; ensure the same parsing/semantics as Warning so both Log and Warning behave consistently.benchmarks/bench_moe_deepseek.py (1)
43-52: Gate profiler calls behind a flag to match project patterns.
cuda_profiler_range()unconditionally invokescudaProfilerStart/Stop()on every benchmark run. While these calls are safe no-ops when no profiling tool is attached, other benchmarks in the codebase (e.g.,benchmarks/routines/moe_comm.pywithnvtx_range(enabled=...)andbench_rope_quantize_fp8.pywith conditionalmode_ncuchecks) gate profiler control behind explicit flags. Consider adding a command-line flag (e.g.,--enable-profiler) to match this pattern and make profiling opt-in.Also applies to: 375-384, 483-492, 610-619
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/bench_moe_deepseek.py` around lines 43 - 52, The cuda_profiler_range context manager currently always calls cudaProfilerStart/Stop; make profiler control opt-in by adding a command-line flag (e.g., --enable-profiler) and only invoke CUDA profiler/NVTX when the flag is true: add a global or parsed argument (referenced from argparse in this file) and update cuda_profiler_range to check that flag before calling torch.cuda.cudart().cudaProfilerStart(), torch.cuda.nvtx.range_push(), range_pop(), and cudaProfilerStop(); likewise wrap or gate the other NVTX/profiler usages noted around lines 375-384, 483-492, and 610-619 to consult the same flag so profiling becomes explicitly enabled rather than unconditional.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/bench_moe_deepseek.py`:
- Around line 106-108: Rename the ambiguous single-letter variable I to a
descriptive name (e.g., intermediate_size) wherever it's used: replace I with
intermediate_size in the block that assigns H = CFG.hidden_size and I =
CFG.intermediate_size and in the subsequent FLOPS computation flops =
local_tokens * (2 * H * 2 * I + 2 * I * H), and also update the same rename in
the later helper usages around the second occurrence (lines 119-123) so all
references (intermediate_size, H, local_tokens, flops) remain consistent.
- Around line 99-109: The calc_tflops function (and the other throughput helper
in the same file that computes TFLOPS/throughput) must guard against
non-positive latency: clamp ms to a small positive epsilon (e.g. ms = max(ms,
1e-6)) or return 0.0 immediately when ms <= 0 to avoid divide-by-zero or
negative throughput, and apply the same fix to the corresponding helper at lines
112-133 so both functions consistently handle ms <= 0.
In `@csrc/trtllm_batched_gemm_runner.cu`:
- Around line 299-301: The error message concatenation prints
maxNumCtasInBatchDim twice; update the string built around transposeMmaOutput,
configIndex, and maxNumCtasInBatchDim so the duplicate is removed and the
intended field is logged (either remove the second ", maxNumCtasInBatchDim: "
segment or replace it with the correct variable name if another field was
meant); look for the concatenation that references transposeMmaOutput,
configIndex, and maxNumCtasInBatchDim in trtllm_batched_gemm_runner.cu and
correct the duplicated token accordingly.
- Around line 279-286: The debug log call to FLASHINFER_LOG prints two fields
"MaxNumCgasInBatchDim" and "MaxNumCtasInBatchDim" but passes
maxNumCtasInBatchDim for both; update the arguments so the
"MaxNumCgasInBatchDim" label is paired with the correct variable
(maxNumCgasInBatchDim) and "MaxNumCtasInBatchDim" remains paired with
maxNumCtasInBatchDim in the FLASHINFER_LOG call.
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 109-128: In get_launcher, the if (it == launchers_map.end()) block
contains a redundant FLASHINFER_CHECK(it != launchers_map.end(), ...); remove
the surrounding if and replace it with a single direct check: call
FLASHINFER_CHECK(it != launchers_map.end(), op_name, "...", tile_N,
"(selected_tile_count=", selected_tile_nums.size(), ")") after computing it, so
the error path is clear; reference symbols: get_launcher, launchers_map,
selected_tile_nums, tile_N, and FLASHINFER_CHECK.
---
Nitpick comments:
In `@benchmarks/bench_moe_deepseek.py`:
- Around line 43-52: The cuda_profiler_range context manager currently always
calls cudaProfilerStart/Stop; make profiler control opt-in by adding a
command-line flag (e.g., --enable-profiler) and only invoke CUDA profiler/NVTX
when the flag is true: add a global or parsed argument (referenced from argparse
in this file) and update cuda_profiler_range to check that flag before calling
torch.cuda.cudart().cudaProfilerStart(), torch.cuda.nvtx.range_push(),
range_pop(), and cudaProfilerStop(); likewise wrap or gate the other
NVTX/profiler usages noted around lines 375-384, 483-492, and 610-619 to consult
the same flag so profiling becomes explicitly enabled rather than unconditional.
In `@include/flashinfer/exception.h`:
- Around line 115-128: The Log class should respect a FLASHINFER_LOGLEVEL
environment variable like the Python logger and the existing Warning class;
modify Log (constructor or emit method) to read FLASHINFER_LOGLEVEL (e.g., parse
into levels like ERROR/WARN/INFO/DEBUG), store or compare a numeric threshold,
and only write to std::cerr when the message level meets the configured
threshold; ensure the same parsing/semantics as Warning so both Log and Warning
behave consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 42ea45b7-3c89-4025-bf3e-5475757042bc
📒 Files selected for processing (7)
benchmarks/bench_moe_deepseek.pycsrc/trtllm_batched_gemm_runner.cucsrc/trtllm_fused_moe_kernel_launcher.cucsrc/trtllm_fused_moe_runner.cuinclude/flashinfer/exception.hinclude/flashinfer/trtllm/batched_gemm/KernelRunner.hinclude/flashinfer/trtllm/fused_moe/runner.h
| def calc_tflops(local_tokens, ms): | ||
| """Calculate TFLOPS using actual routed token count. | ||
|
|
||
| With EP, only tokens routed to local experts are computed. | ||
| Assumes uniform routing distribution across experts. | ||
| FC1: [M, H] x [H, 2I] | ||
| FC2: [M, I] x [I, H] | ||
| FLOPs = 2 * local_tokens * (H*2I + I*H) = 6 * local_tokens * H * I | ||
| """ | ||
| if num_local_experts is None: | ||
| num_local_experts = CFG.num_experts | ||
| H = CFG.hidden_size | ||
| I = CFG.intermediate_size | ||
| flops = local_tokens * (2 * H * 2 * I + 2 * I * H) | ||
| return flops / (ms * 1e-3) / 1e12 |
There was a problem hiding this comment.
Guard throughput computations against non-positive latency.
If ms <= 0, both helpers can raise/diverge during division. Add a small guard to keep benchmarking output resilient.
Proposed fix
def calc_tflops(local_tokens, ms):
@@
- H = CFG.hidden_size
- I = CFG.intermediate_size
- flops = local_tokens * (2 * H * 2 * I + 2 * I * H)
+ if ms <= 0:
+ return float("nan")
+ H = CFG.hidden_size
+ I = CFG.intermediate_size
+ flops = local_tokens * (2 * H * 2 * I + 2 * I * H)
return flops / (ms * 1e-3) / 1e12
def calc_bw(local_tokens, active_experts, ms):
@@
- H = CFG.hidden_size
- I = CFG.intermediate_size
+ if ms <= 0:
+ return float("nan")
+ H = CFG.hidden_size
+ I = CFG.intermediate_sizeAlso applies to: 112-133
🧰 Tools
🪛 Ruff (0.15.5)
[error] 107-107: Ambiguous variable name: I
(E741)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/bench_moe_deepseek.py` around lines 99 - 109, The calc_tflops
function (and the other throughput helper in the same file that computes
TFLOPS/throughput) must guard against non-positive latency: clamp ms to a small
positive epsilon (e.g. ms = max(ms, 1e-6)) or return 0.0 immediately when ms <=
0 to avoid divide-by-zero or negative throughput, and apply the same fix to the
corresponding helper at lines 112-133 so both functions consistently handle ms
<= 0.
| H = CFG.hidden_size | ||
| I = CFG.intermediate_size | ||
| flops = local_tokens * (2 * H * 2 * I + 2 * I * H) |
There was a problem hiding this comment.
Rename single-letter I to avoid Ruff E741 lint errors.
I is flagged as ambiguous (E741). Rename to a descriptive identifier (e.g., intermediate_size) in both helpers.
Proposed fix
- H = CFG.hidden_size
- I = CFG.intermediate_size
- flops = local_tokens * (2 * H * 2 * I + 2 * I * H)
+ H = CFG.hidden_size
+ intermediate_size = CFG.intermediate_size
+ flops = local_tokens * (
+ 2 * H * 2 * intermediate_size + 2 * intermediate_size * H
+ )
@@
- H = CFG.hidden_size
- I = CFG.intermediate_size
+ H = CFG.hidden_size
+ intermediate_size = CFG.intermediate_size
- weight_bytes = active_experts * (H * 2 * I + I * H) * NVFP4_BYTES
+ weight_bytes = (
+ active_experts * (H * 2 * intermediate_size + intermediate_size * H) * NVFP4_BYTES
+ )Also applies to: 119-123
🧰 Tools
🪛 Ruff (0.15.5)
[error] 107-107: Ambiguous variable name: I
(E741)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/bench_moe_deepseek.py` around lines 106 - 108, Rename the
ambiguous single-letter variable I to a descriptive name (e.g.,
intermediate_size) wherever it's used: replace I with intermediate_size in the
block that assigns H = CFG.hidden_size and I = CFG.intermediate_size and in the
subsequent FLOPS computation flops = local_tokens * (2 * H * 2 * I + 2 * I * H),
and also update the same rename in the later helper usages around the second
occurrence (lines 119-123) so all references (intermediate_size, H,
local_tokens, flops) remain consistent.
| if (getBoolEnv("TRTLLM_BATCHED_GEMM_PRINT_NAME")) { | ||
| FLASHINFER_LOG("NumBatches", numBatches, ", MaxNumCgasInBatchDim", maxNumCtasInBatchDim, | ||
| ", MaxNumCtasInBatchDim", maxNumCtasInBatchDim, ", ShapeMNK", | ||
| gemmData.mProblemDimensions.mM, gemmData.mProblemDimensions.mN, | ||
| gemmData.mProblemDimensions.mK, ", ValidShapeMNK", | ||
| gemmData.mProblemDimensions.mValidM, gemmData.mProblemDimensions.mValidN, | ||
| gemmData.mProblemDimensions.mValidK, ", Kernel", config.mFunctionName); | ||
| } |
There was a problem hiding this comment.
Duplicate field in debug logging.
Line 280-281 logs MaxNumCgasInBatchDim followed by MaxNumCtasInBatchDim with the same value (maxNumCtasInBatchDim). This appears to be a copy-paste error.
🔧 Proposed fix
if (getBoolEnv("TRTLLM_BATCHED_GEMM_PRINT_NAME")) {
- FLASHINFER_LOG("NumBatches", numBatches, ", MaxNumCgasInBatchDim", maxNumCtasInBatchDim,
- ", MaxNumCtasInBatchDim", maxNumCtasInBatchDim, ", ShapeMNK",
+ FLASHINFER_LOG("NumBatches", numBatches, ", MaxNumCtasInBatchDim", maxNumCtasInBatchDim,
+ ", ShapeMNK",
gemmData.mProblemDimensions.mM, gemmData.mProblemDimensions.mN,
gemmData.mProblemDimensions.mK, ", ValidShapeMNK",
gemmData.mProblemDimensions.mValidM, gemmData.mProblemDimensions.mValidN,
gemmData.mProblemDimensions.mValidK, ", Kernel", config.mFunctionName);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/trtllm_batched_gemm_runner.cu` around lines 279 - 286, The debug log
call to FLASHINFER_LOG prints two fields "MaxNumCgasInBatchDim" and
"MaxNumCtasInBatchDim" but passes maxNumCtasInBatchDim for both; update the
arguments so the "MaxNumCgasInBatchDim" label is paired with the correct
variable (maxNumCgasInBatchDim) and "MaxNumCtasInBatchDim" remains paired with
maxNumCtasInBatchDim in the FLASHINFER_LOG call.
| ", transposeMmaOutput: ", transposeMmaOutput, ", configIndex: ", configIndex, | ||
| ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim, | ||
| ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim, ")"); |
There was a problem hiding this comment.
Duplicate field in error message.
maxNumCtasInBatchDim is printed twice in the error message.
🔧 Proposed fix
FLASHINFER_CHECK(err == 0,
"Error occurred when running GEMM!"
" (numBatches: ",
numBatches, ", GemmMNK: ", m, " ", n, " ", k, ", Kernel: ", config.mFunctionName,
- ", transposeMmaOutput: ", transposeMmaOutput, ", configIndex: ", configIndex,
- ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim,
- ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim, ")");
+ ", transposeMmaOutput: ", transposeMmaOutput, ", configIndex: ", configIndex,
+ ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim, ")");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ", transposeMmaOutput: ", transposeMmaOutput, ", configIndex: ", configIndex, | |
| ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim, | |
| ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim, ")"); | |
| ", transposeMmaOutput: ", transposeMmaOutput, ", configIndex: ", configIndex, | |
| ", maxNumCtasInBatchDim: ", maxNumCtasInBatchDim, ")"); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/trtllm_batched_gemm_runner.cu` around lines 299 - 301, The error message
concatenation prints maxNumCtasInBatchDim twice; update the string built around
transposeMmaOutput, configIndex, and maxNumCtasInBatchDim so the duplicate is
removed and the intended field is logged (either remove the second ",
maxNumCtasInBatchDim: " segment or replace it with the correct variable name if
another field was meant); look for the concatenation that references
transposeMmaOutput, configIndex, and maxNumCtasInBatchDim in
trtllm_batched_gemm_runner.cu and correct the duplicated token accordingly.
| template <typename LauncherType> | ||
| // Select a launcher | ||
| LauncherType& get_launcher( | ||
| std::unordered_map<int32_t, std::unique_ptr<LauncherType>>& launchers_map, | ||
| std::set<int32_t> const& selected_tile_nums, int64_t& tile_N, int64_t& config, | ||
| char const* op_name) { | ||
| FLASHINFER_CHECK(!selected_tile_nums.empty(), op_name, ": no available tile_N candidates"); | ||
|
|
||
| if (tile_N == -1) { | ||
| tile_N = *selected_tile_nums.begin(); | ||
| } | ||
|
|
||
| auto it = launchers_map.find(static_cast<int32_t>(tile_N)); | ||
| if (it == launchers_map.end()) { | ||
| FLASHINFER_CHECK(it != launchers_map.end(), op_name, ": failed to select launcher for tile_N ", | ||
| tile_N, " (selected_tile_count=", selected_tile_nums.size(), ")"); | ||
| } | ||
|
|
||
| return *(it->second); | ||
| } |
There was a problem hiding this comment.
Redundant condition check in get_launcher.
The FLASHINFER_CHECK on Line 123-124 is inside an if (it == launchers_map.end()) block, so the check it != launchers_map.end() will always fail. The outer if makes the inner check redundant.
🔧 Proposed fix
auto it = launchers_map.find(static_cast<int32_t>(tile_N));
- if (it == launchers_map.end()) {
- FLASHINFER_CHECK(it != launchers_map.end(), op_name, ": failed to select launcher for tile_N ",
- tile_N, " (selected_tile_count=", selected_tile_nums.size(), ")");
- }
+ FLASHINFER_CHECK(it != launchers_map.end(), op_name, ": failed to select launcher for tile_N ",
+ tile_N, " (selected_tile_count=", selected_tile_nums.size(), ")");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| template <typename LauncherType> | |
| // Select a launcher | |
| LauncherType& get_launcher( | |
| std::unordered_map<int32_t, std::unique_ptr<LauncherType>>& launchers_map, | |
| std::set<int32_t> const& selected_tile_nums, int64_t& tile_N, int64_t& config, | |
| char const* op_name) { | |
| FLASHINFER_CHECK(!selected_tile_nums.empty(), op_name, ": no available tile_N candidates"); | |
| if (tile_N == -1) { | |
| tile_N = *selected_tile_nums.begin(); | |
| } | |
| auto it = launchers_map.find(static_cast<int32_t>(tile_N)); | |
| if (it == launchers_map.end()) { | |
| FLASHINFER_CHECK(it != launchers_map.end(), op_name, ": failed to select launcher for tile_N ", | |
| tile_N, " (selected_tile_count=", selected_tile_nums.size(), ")"); | |
| } | |
| return *(it->second); | |
| } | |
| template <typename LauncherType> | |
| // Select a launcher | |
| LauncherType& get_launcher( | |
| std::unordered_map<int32_t, std::unique_ptr<LauncherType>>& launchers_map, | |
| std::set<int32_t> const& selected_tile_nums, int64_t& tile_N, int64_t& config, | |
| char const* op_name) { | |
| FLASHINFER_CHECK(!selected_tile_nums.empty(), op_name, ": no available tile_N candidates"); | |
| if (tile_N == -1) { | |
| tile_N = *selected_tile_nums.begin(); | |
| } | |
| auto it = launchers_map.find(static_cast<int32_t>(tile_N)); | |
| FLASHINFER_CHECK(it != launchers_map.end(), op_name, ": failed to select launcher for tile_N ", | |
| tile_N, " (selected_tile_count=", selected_tile_nums.size(), ")"); | |
| return *(it->second); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 109 - 128, In
get_launcher, the if (it == launchers_map.end()) block contains a redundant
FLASHINFER_CHECK(it != launchers_map.end(), ...); remove the surrounding if and
replace it with a single direct check: call FLASHINFER_CHECK(it !=
launchers_map.end(), op_name, "...", tile_N, "(selected_tile_count=",
selected_tile_nums.size(), ")") after computing it, so the error path is clear;
reference symbols: get_launcher, launchers_map, selected_tile_nums, tile_N, and
FLASHINFER_CHECK.
📌 Description
This supercedes #2741. Contrary to #2741, this PR assumes that TRTLLM MoE kernels interpret the indices of the routing table in CGA granularity. As result, routing kernel can stay unchanged.
This is for preparing to support non-SwapAb/... kernels in TRTLLM MoE. Cubins will be prepared separately.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
Unit tests passed
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Improvements
Bug Fixes