fix: DeepSeek activation uninitialized data by nekorobov · Pull Request #2128 · flashinfer-ai/flashinfer

nekorobov · 2025-11-21T11:36:32Z

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes

Corrected initialization procedures for computational state management to ensure consistency during model inference operations.

Performance

Optimized thread allocation configuration for improved computational efficiency and inference throughput.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-21T11:36:46Z

Walkthrough

Modified the DeepSeek activation kernel in TensorRT-LLM's fused MOE implementation by extracting a constant for thread configuration and adding per-token state initialization loops to ensure clean array states before accumulation across tokens.

Changes

Cohort / File(s)	Summary
DeepSeek Activation Kernel Refactoring `csrc/trtllm_fused_moe_dev_kernel.cu`	Introduced public constant `DEEP_SEEK_ACTIVATION_NUM_THREADS_PER_CTA` (128) to replace hard-coded value; updated kernel launch to use this constant; added per-token-CTA zero-initialization loops for local arrays (`scale1Arr`, `scale2Arr`, `dataX1Arr`, `dataX2Arr`, `outArr`, `absOutArr`) to guarantee clean state before per-token accumulation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring attention:
- Correctness of per-token initialization: verify all accumulator arrays are properly cleared before each token's processing
- CUDA kernel synchronization semantics: confirm initialization loops don't introduce unintended performance bottlenecks or race conditions
- Thread block configuration: validate that DEEP_SEEK_ACTIVATION_NUM_THREADS_PER_CTA (128) is correctly applied to BlockReduce template instantiation

Poem

🐰 A kernel once lost in the weeds,
Now zeros its slates for each token's deed,
With constants extracted so clean,
The deepest of seeks now stays serene,
State initialized—no ghosts remain! ✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description uses the correct template structure but leaves the Description and Related Issues sections empty, providing no actual context about the changes.	Fill in the Description section with details about the uninitialized data bug and why it needed fixing. Link any related issues in the Related Issues section.
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main fix addressing uninitialized data in DeepSeek activation, directly matching the code changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0aee7af and d436b46.

📒 Files selected for processing (1)

csrc/trtllm_fused_moe_dev_kernel.cu (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (4)

csrc/trtllm_fused_moe_dev_kernel.cu (4)

199-199: LGTM: Good constant extraction for maintainability.

Extracting the thread count as a named constant improves code clarity and ensures consistency across the kernel definition and launch site.

208-208: LGTM: Consistent use of thread count constant.

Using the constant here ensures the BlockReduce configuration matches the kernel launch parameters, preventing subtle bugs from mismatched thread counts.

240-248: Critical fix: Ensures arrays are initialized before use.

This initialization loop addresses the uninitialized data bug described in the PR title. Previously, when permutedIdx == -1 (line 259) or when tokenIdx >= params.numTokens (line 252), array elements would remain uninitialized but still be used in subsequent computations (lines 278-285) and reductions (line 288). Initializing to 0.0f provides defined behavior and ensures these elements don't contribute garbage values to the final result.

368-369: LGTM: Consistent kernel launch configuration.

Using the constant here ensures the kernel is launched with the thread count that matches the BlockReduce template configuration (line 208), maintaining consistency across the implementation.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-11-21T11:36:51Z

Summary of Changes

Hello @nekorobov, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a bug in the DeepSeek activation kernel where certain local arrays were used without prior initialization, which could result in erroneous calculations. The fix involves explicitly initializing these arrays to zero. Additionally, a minor refactoring was performed to define the kernel's thread block size using a named constant for better code readability and consistency.

Highlights

Fix Uninitialized Data: Addressed a critical bug where local arrays within the activationDeepSeekKernel were not explicitly initialized, leading to potential undefined behavior and incorrect computations. These arrays are now set to 0.0f before use.
Refactor Thread Count Constant: Introduced a constexpr variable, DEEP_SEEK_ACTIVATION_NUM_THREADS_PER_CTA, to centralize the definition of the number of threads per CTA for the DeepSeek activation kernel, improving code clarity and maintainability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes a critical bug where register arrays in the activationDeepSeekKernel were used without being initialized, leading to incorrect computations. The fix correctly initializes these arrays to zero at the beginning of each hiddenIdx loop iteration. The changes also improve code maintainability by replacing a magic number for the number of threads with a named constexpr constant. The fix is correct and necessary. I've added one suggestion to further optimize the initialization by removing redundant assignments.

gemini-code-assist · 2025-11-21T11:39:06Z

csrc/trtllm_fused_moe_dev_kernel.cu

+        for (int tokenInCtaIdx = 0; tokenInCtaIdx < NumTokensPerCta; tokenInCtaIdx++) {
+          scale1Arr[tokenInCtaIdx] = 0.0f;
+          scale2Arr[tokenInCtaIdx] = 0.0f;
+          dataX1Arr[tokenInCtaIdx] = 0.0f;
+          dataX2Arr[tokenInCtaIdx] = 0.0f;
+          outArr[tokenInCtaIdx] = 0.0f;
+          absOutArr[tokenInCtaIdx] = 0.0f;
+        }


The arrays outArr and absOutArr are unconditionally written to in the subsequent loop (lines 278-285) before being read. Therefore, initializing them to zero here is redundant and can be removed for a minor performance improvement.

for (int tokenInCtaIdx = 0; tokenInCtaIdx < NumTokensPerCta; tokenInCtaIdx++) { scale1Arr[tokenInCtaIdx] = 0.0f; scale2Arr[tokenInCtaIdx] = 0.0f; dataX1Arr[tokenInCtaIdx] = 0.0f; dataX2Arr[tokenInCtaIdx] = 0.0f; }

@nekorobov do you think gemini's suggestion is reasonable?

yzh119 · 2025-11-22T07:26:26Z

/bot run

flashinfer-bot · 2025-11-22T07:27:22Z

GitLab MR !157 has been created, and the CI pipeline #38984466 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-11-22T11:29:28Z

[FAILED] Pipeline #38984466: 14/18 passed

fix

d436b46

nekorobov requested review from cyx-6, djmmoss, wenscarl and yzh119 as code owners November 21, 2025 11:36

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

yzh119 approved these changes Nov 22, 2025

View reviewed changes

yzh119 merged commit cf2df82 into flashinfer-ai:main Nov 22, 2025
4 checks passed

coderabbitai bot mentioned this pull request Feb 27, 2026

[fp8_blockwise]Fix int32 overflow in TRTLLM fused MoE activation kernel #2642

Merged

2 tasks

BingooYang pushed a commit to BingooYang/flashinfer that referenced this pull request Mar 13, 2026

fix: DeepSeek activation uninitialized data (flashinfer-ai#2128)

56f0068

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: DeepSeek activation uninitialized data#2128

fix: DeepSeek activation uninitialized data#2128
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
nekorobov:nkorobov/ds-activation-fix

nekorobov commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 21, 2025

Uh oh!

yzh119 Nov 21, 2025

Uh oh!

yzh119 commented Nov 22, 2025

Uh oh!

flashinfer-bot commented Nov 22, 2025

Uh oh!

flashinfer-bot commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nekorobov commented Nov 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Bug Fixes

Performance

Uh oh!

coderabbitai bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Nov 22, 2025

Uh oh!

flashinfer-bot commented Nov 22, 2025

Uh oh!

flashinfer-bot commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nekorobov commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 21, 2025 •

edited

Loading