Skip to content

fix: DeepSeek activation uninitialized data#2128

Merged
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
nekorobov:nkorobov/ds-activation-fix
Nov 22, 2025
Merged

fix: DeepSeek activation uninitialized data#2128
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
nekorobov:nkorobov/ds-activation-fix

Conversation

@nekorobov
Copy link
Copy Markdown
Collaborator

@nekorobov nekorobov commented Nov 21, 2025

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes

  • Corrected initialization procedures for computational state management to ensure consistency during model inference operations.

Performance

  • Optimized thread allocation configuration for improved computational efficiency and inference throughput.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 21, 2025

Walkthrough

Modified the DeepSeek activation kernel in TensorRT-LLM's fused MOE implementation by extracting a constant for thread configuration and adding per-token state initialization loops to ensure clean array states before accumulation across tokens.

Changes

Cohort / File(s) Summary
DeepSeek Activation Kernel Refactoring
csrc/trtllm_fused_moe_dev_kernel.cu
Introduced public constant DEEP_SEEK_ACTIVATION_NUM_THREADS_PER_CTA (128) to replace hard-coded value; updated kernel launch to use this constant; added per-token-CTA zero-initialization loops for local arrays (scale1Arr, scale2Arr, dataX1Arr, dataX2Arr, outArr, absOutArr) to guarantee clean state before per-token accumulation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Areas requiring attention:
    • Correctness of per-token initialization: verify all accumulator arrays are properly cleared before each token's processing
    • CUDA kernel synchronization semantics: confirm initialization loops don't introduce unintended performance bottlenecks or race conditions
    • Thread block configuration: validate that DEEP_SEEK_ACTIVATION_NUM_THREADS_PER_CTA (128) is correctly applied to BlockReduce template instantiation

Poem

🐰 A kernel once lost in the weeds,
Now zeros its slates for each token's deed,
With constants extracted so clean,
The deepest of seeks now stays serene,
State initialized—no ghosts remain! ✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description uses the correct template structure but leaves the Description and Related Issues sections empty, providing no actual context about the changes. Fill in the Description section with details about the uninitialized data bug and why it needed fixing. Link any related issues in the Related Issues section.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main fix addressing uninitialized data in DeepSeek activation, directly matching the code changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0aee7af and d436b46.

📒 Files selected for processing (1)
  • csrc/trtllm_fused_moe_dev_kernel.cu (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (4)
csrc/trtllm_fused_moe_dev_kernel.cu (4)

199-199: LGTM: Good constant extraction for maintainability.

Extracting the thread count as a named constant improves code clarity and ensures consistency across the kernel definition and launch site.


208-208: LGTM: Consistent use of thread count constant.

Using the constant here ensures the BlockReduce configuration matches the kernel launch parameters, preventing subtle bugs from mismatched thread counts.


240-248: Critical fix: Ensures arrays are initialized before use.

This initialization loop addresses the uninitialized data bug described in the PR title. Previously, when permutedIdx == -1 (line 259) or when tokenIdx >= params.numTokens (line 252), array elements would remain uninitialized but still be used in subsequent computations (lines 278-285) and reductions (line 288). Initializing to 0.0f provides defined behavior and ensures these elements don't contribute garbage values to the final result.


368-369: LGTM: Consistent kernel launch configuration.

Using the constant here ensures the kernel is launched with the thread count that matches the BlockReduce template configuration (line 208), maintaining consistency across the implementation.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @nekorobov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a bug in the DeepSeek activation kernel where certain local arrays were used without prior initialization, which could result in erroneous calculations. The fix involves explicitly initializing these arrays to zero. Additionally, a minor refactoring was performed to define the kernel's thread block size using a named constant for better code readability and consistency.

Highlights

  • Fix Uninitialized Data: Addressed a critical bug where local arrays within the activationDeepSeekKernel were not explicitly initialized, leading to potential undefined behavior and incorrect computations. These arrays are now set to 0.0f before use.
  • Refactor Thread Count Constant: Introduced a constexpr variable, DEEP_SEEK_ACTIVATION_NUM_THREADS_PER_CTA, to centralize the definition of the number of threads per CTA for the DeepSeek activation kernel, improving code clarity and maintainability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a critical bug where register arrays in the activationDeepSeekKernel were used without being initialized, leading to incorrect computations. The fix correctly initializes these arrays to zero at the beginning of each hiddenIdx loop iteration. The changes also improve code maintainability by replacing a magic number for the number of threads with a named constexpr constant. The fix is correct and necessary. I've added one suggestion to further optimize the initialization by removing redundant assignments.

Comment on lines +241 to +248
for (int tokenInCtaIdx = 0; tokenInCtaIdx < NumTokensPerCta; tokenInCtaIdx++) {
scale1Arr[tokenInCtaIdx] = 0.0f;
scale2Arr[tokenInCtaIdx] = 0.0f;
dataX1Arr[tokenInCtaIdx] = 0.0f;
dataX2Arr[tokenInCtaIdx] = 0.0f;
outArr[tokenInCtaIdx] = 0.0f;
absOutArr[tokenInCtaIdx] = 0.0f;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The arrays outArr and absOutArr are unconditionally written to in the subsequent loop (lines 278-285) before being read. Therefore, initializing them to zero here is redundant and can be removed for a minor performance improvement.

        for (int tokenInCtaIdx = 0; tokenInCtaIdx < NumTokensPerCta; tokenInCtaIdx++) {
          scale1Arr[tokenInCtaIdx] = 0.0f;
          scale2Arr[tokenInCtaIdx] = 0.0f;
          dataX1Arr[tokenInCtaIdx] = 0.0f;
          dataX2Arr[tokenInCtaIdx] = 0.0f;
        }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nekorobov do you think gemini's suggestion is reasonable?

@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Nov 22, 2025

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !157 has been created, and the CI pipeline #38984466 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #38984466: 14/18 passed

@yzh119 yzh119 merged commit cf2df82 into flashinfer-ai:main Nov 22, 2025
4 checks passed
BingooYang pushed a commit to BingooYang/flashinfer that referenced this pull request Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants