-
Notifications
You must be signed in to change notification settings - Fork 585
perf: Speed up fp4 quantization for small batch with swizzling for cutlass MoE #2025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughGrid sizing for several quantization kernels now computes effectiveRows for swizzled SF layouts and clamps to SM capacity; the per-row quantization loop was refactored to early-exit on padding-only rows (skipping data work) while zeroing SF outputs in both padding and data paths. Changes
Sequence Diagram(s)sequenceDiagram
participant Host
participant KernelLauncher
participant GPUKernel
note right of KernelLauncher `#e8f4ff`: computeEffectiveRows(m, layout, SMs, blocksPerSM)
Host->>KernelLauncher: request quantization (m, layout, ...)
KernelLauncher->>GPUKernel: launch kernel with grid.x = effectiveRows
GPUKernel->>GPUKernel: compute rowIdx
alt rowIdx is padding
GPUKernel->>GPUKernel: zero SF outputs for padding columns
GPUKernel-->>GPUKernel: skip input load & quantize
else data row
GPUKernel->>GPUKernel: load input vector
GPUKernel->>GPUKernel: perform quantization & write outputs
GPUKernel->>GPUKernel: zero SF outputs for padding columns (if any)
end
GPUKernel->>Host: kernel completes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the performance of FP4 quantization, particularly for scenarios involving small batch sizes within Cutlass Mixture of Experts (MoE) applications. The improvements are achieved through strategic adjustments to CUDA kernel launch parameters, ensuring optimal hardware utilization for swizzled data layouts, and by refining the quantization kernel's logic to efficiently process padded data, thereby minimizing redundant computations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces performance optimizations for FP4 quantization, particularly for small batch sizes with swizzled layouts. The changes involve adjusting the CUDA grid dimensions to account for padded rows and refactoring the quantization kernel to handle padding rows more efficiently.
My review focuses on improving code maintainability by addressing code duplication. I've identified two areas where logic is repeated and have suggested creating helper functions or restructuring the code to eliminate this duplication. These changes should make the code cleaner and easier to maintain without affecting the performance improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh (1)
808-862: Refactor duplicated SF output pointer retrieval.The SF output pointer retrieval code (lines 816-818) is duplicated from the padding path (lines 798-800). This duplication increases maintenance burden.
Consider hoisting the SF pointer retrieval outside the
if (isRowPadding)branch to eliminate duplication:+ for (int batchIdx = 0; batchIdx < numbatches; batchIdx++) { + for (int colIdx = threadIdx.x; colIdx < numColThreadsForSf; colIdx += blockDim.x) { + std::optional<int> optionalBatchIdx = batchIdx; + std::optional<int> optionalNumRows = numRows; + + // The SF output pointer (retrieved once for both paths). + auto sf_out = cvt_quant_get_sf_out_offset<uint32_t, CVT_NUM_THREADS_PER_SF>( + optionalBatchIdx, rowIdx, colIdx, optionalNumRows, numPaddedCols / SF_VEC_SIZE, SFout, + layout); + if (isRowPadding) { - // Fast path: This row is entirely padding, only zero out scale factors - for (int batchIdx = 0; batchIdx < numbatches; batchIdx++) { - for (int colIdx = threadIdx.x; colIdx < numColThreadsForSf; colIdx += blockDim.x) { - std::optional<int> optionalBatchIdx = batchIdx; - std::optional<int> optionalNumRows = numRows; - - // The SF output pointer. - auto sf_out = cvt_quant_get_sf_out_offset<uint32_t, CVT_NUM_THREADS_PER_SF>( - optionalBatchIdx, rowIdx, colIdx, optionalNumRows, numPaddedCols / SF_VEC_SIZE, SFout, - layout); - - // Set the SF padding to 0. - if (sf_out != nullptr) { - sf_out[0] = 0x00; - } - } - } + // Fast path: zero SF only + if (sf_out != nullptr) { + sf_out[0] = 0x00; + } } else { - // Normal path: This row contains actual data - for (int batchIdx = 0; batchIdx < numbatches; batchIdx++) { - for (int colIdx = threadIdx.x; colIdx < numColThreadsForSf; colIdx += blockDim.x) { - std::optional<int> optionalBatchIdx = batchIdx; - std::optional<int> optionalNumRows = numRows; - - // The SF output pointer. - auto sf_out = cvt_quant_get_sf_out_offset<uint32_t, CVT_NUM_THREADS_PER_SF>( - optionalBatchIdx, rowIdx, colIdx, optionalNumRows, numPaddedCols / SF_VEC_SIZE, SFout, - layout); - + // Normal path: process data // ... rest of data processing ... + } + } + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
csrc/nv_internal/cpp/kernels/quantization.cu(3 hunks)csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
csrc/nv_internal/cpp/kernels/quantization.cu (1)
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/tile_interleaved_layout.h (1)
layout(29-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (2)
csrc/nv_internal/cpp/kernels/quantization.cu (1)
88-102: Verify grid dimension calculation is correct.The grid dimension calculation appears to apply
std::mintwice with the same upper boundmultiProcessorCount * numBlocksPerSM, which is redundant.Line 102 applies
std::min(effectiveRows, multiProcessorCount * numBlocksPerSM), buteffectiveRowsis already capped atmultiProcessorCount * numBlocksPerSMon line 99. The secondstd::minis redundant.Apply this diff to simplify:
- dim3 grid(std::min(effectiveRows, multiProcessorCount * numBlocksPerSM)); + dim3 grid(effectiveRows);This same issue exists in lines 208 and 242 for the other two functions.
Likely an incorrect or invalid review comment.
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh (1)
781-788: Grid configuration correctly ensures padding rows are assigned to blocks.The optimization has been properly implemented. For swizzled layouts,
effectiveRowsis set tostd::min(PadUpFn(numRows, rowTile), SM_limit)in the host code, which expands the grid to include padded rows. Since both the host and kernel use the samePadUpFnmacro (defined as((X + Y - 1) / (Y) * (Y))), the kernel'snumPaddedRowsForSfcomputation matches the grid sizing. This ensuresgridDim.x >= numRowswhen m is not divisible byrowTile, allowing blocks to reach indices whererowIdx >= numRowsand trigger the padding-only optimization path. The original concern about blocks not reaching padding rows has been addressed.
|
/bot stop |
|
The GitLab CI pipeline #37820809 has been cancelled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
csrc/nv_internal/cpp/kernels/quantization.cu(4 hunks)csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
csrc/nv_internal/cpp/kernels/quantization.cu (1)
csrc/trtllm_gemm_runner.cu (8)
m(111-126)m(111-111)m(128-179)m(128-130)m(181-236)m(181-181)m(238-250)m(238-238)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh (1)
783-784: Clarify the optimization description.The comment states "Iterate over actual rows first (hot path), then padding rows (cold path)", but the loop at line 785 iterates sequentially from
0tonumPaddedRowsForSf. The optimization is actually an early-exit fast path for padding rows (viaisRowPaddingcheck), not a reordering of iteration.Consider revising the comment to:
- // Optimization: Iterate over actual rows first (hot path), then padding rows (cold path) - // This improves performance for small batch sizes with swizzled layout + // Optimization: Fast-path early exit for padding rows to skip input loading and quantization + // This improves performance for small batches with swizzled layout
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (1)
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh (1)
829-861: Column padding logic is correct.The nested conditionals correctly handle three ranges:
colIdx < numColThreads: Process actual datanumColThreads <= colIdx < numPaddedColThreads: Zero both quantized output and SFcolIdx >= numPaddedColThreads: Zero SF only (extra padding for swizzled SF layout)Memory accesses are correctly bounded -
outOffsetis only used whencolIdx < numPaddedColThreads.
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh (1)
829-861: Consider restructuring column handling for clarity and minor efficiency gain.The current logic is correct but executes lines 829-837 and 840-844 for overlapping column ranges. Threads with
colIdxin[numColThreads, numPaddedColThreads)zero both quantized output (lines 829-837) and SF output (lines 842-843), while threads in[numPaddedColThreads, numColThreadsForSf)only zero SF output.Consider restructuring as non-overlapping branches:
- // Set the values to 0 of those are padded columns. - if (colIdx >= numColThreads && colIdx < numPaddedColThreads) { - // Dispatch the quantization kernel. - if constexpr (quantization_type == BlockScaleQuantizationType::FP16_TO_FP4) { - reinterpret_cast<uint32_t*>(out)[outOffset] = 0u; - } else if constexpr (quantization_type == BlockScaleQuantizationType::FP8_TO_FP4 || - quantization_type == BlockScaleQuantizationType::FP16_TO_MXFP8) { - reinterpret_cast<uint64_t*>(out)[outOffset] = 0ull; - } - } - - // Process actual data or padding - if (colIdx >= numColThreads) { - // Column padding: Set the SF padding to 0. + if (colIdx >= numPaddedColThreads) { + // SF-only padding region: zero SF output only if (sf_out != nullptr) { sf_out[0] = 0x00; } + } else if (colIdx >= numColThreads) { + // Quantized output padding region: zero both quantized output and SF + if constexpr (quantization_type == BlockScaleQuantizationType::FP16_TO_FP4) { + reinterpret_cast<uint32_t*>(out)[outOffset] = 0u; + } else if constexpr (quantization_type == BlockScaleQuantizationType::FP8_TO_FP4 || + quantization_type == BlockScaleQuantizationType::FP16_TO_MXFP8) { + reinterpret_cast<uint64_t*>(out)[outOffset] = 0ull; + } + if (sf_out != nullptr) { + sf_out[0] = 0x00; + } } else { - // Load the input vector. + // Actual data region: load input and quantize PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset]; // Dispatch the quantization kernel.This makes the three column regions explicit and avoids redundant condition checks.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (2)
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh (2)
790-809: Padding row handling looks correct.The fast path correctly skips quantized output writes for padding rows (which don't exist in the output tensor) and only zeros the scale factor buffer. The SF offset calculation at line 801 correctly uses
numColsForSf / SF_VEC_SIZE, ensuring proper bounds for swizzled layouts.
810-837: Offset calculations and column padding zeroing are correct.The data path properly computes input/output offsets using the appropriate column counts (
numColThreadsfor input,numPaddedColThreadsfor output), and correctly zeros quantized output for column padding. The SF offset calculation at line 819 matches the padding path in usingnumColsForSf / SF_VEC_SIZE.
|
[FAILED] Pipeline #37823595: 12/17 passed |
yzh119
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive speedup and the separation of hot path and cold path looks reasonable to me, thanks for this effort!
The failed gb200 ut is not relevant.
|
/bot run |
|
/bot stop |
|
The GitLab CI pipeline #37898618 has been cancelled. |
|
/bot run |
|
[SUCCESS] Pipeline #37898689: 13/17 passed |
📌 Description
Performance optimization for
fp4_quantize()function. The performance issue was raised in issues #1734 and #2021Observed behavior was slow performance when
is_sf_swizzled_layout=True(as opposed to False). Root cause of the issue wasSWIZZLED_128x4pads to multiples of 128 rows andSWIZZLED_8x4pads to multiples of 8 rowsFor batch_size=1with SWIZZLED_128x4: 127 out of 128 rows are padding (99.2% wasted work)For batch_size=1: only 1 block launched
The fix:
Kernel-Level Early Exit Fast Path (
quantization.cuh): Added branch divergence optimization with separate handling for padding vs. data rowsHost-Level Parallel Grid Launch (
quantization.cu): Modified grid calculation to launch blocks proportional to padded rows instead of actual rows:fp4_quantize()performance before fix:After fix in current PR:
where the
bench_fp4_quantize.pyscript used to benchmark (adopted from #1734) :🔍 Related Issues
#1734
#2021
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit