kimi k2 recipe intro by malay-nagda · Pull Request #2097 · NVIDIA-NeMo/Megatron-Bridge

malay-nagda · 2026-01-28T09:08:43Z

What does this PR do ?

Add Kimi-K2 performance recipes.

Also adds updates to-

accommodate different optimizer
PP layout when VP=None
PP layout for user provided PP-VP overrides.

Changelog

__all__ = [
    "KIMI_K2_PRETRAIN_CONFIG_B200_BF16",
    "KIMI_K2_PRETRAIN_CONFIG_B200_FP8_CS",
    "KIMI_K2_PRETRAIN_CONFIG_B200_FP8_MX",
    "KIMI_K2_PRETRAIN_CONFIG_GB200_BF16",
    "KIMI_K2_PRETRAIN_CONFIG_GB200_FP8_CS",
    "KIMI_K2_PRETRAIN_CONFIG_GB200_FP8_MX",
    "KIMI_K2_PRETRAIN_CONFIG_GB300_BF16",
    "KIMI_K2_PRETRAIN_CONFIG_GB300_FP8_CS",
    "KIMI_K2_PRETRAIN_CONFIG_GB300_FP8_MX",
    "KIMI_K2_PRETRAIN_CONFIG_GB300_NVFP4",
    "KIMI_K2_PRETRAIN_CONFIG_H100_BF16",
    "KIMI_K2_PRETRAIN_CONFIG_H100_FP8_CS",
    "KIMI_K2_PRETRAIN_CONFIG_H100_FP8_SC",
]

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Added comprehensive pretraining configuration support for KIMI-K2 model across GB300, GB200, B200, and H100 GPU types
- Support for multiple precision formats including BF16, FP8_CS, FP8_MX, and NVFP4
Bug Fixes
- Improved pipeline layout computation and error handling for specific configurations
- Refined optimizer precision handling for specific optimizer and precision combinations

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot · 2026-01-28T09:08:47Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

Signed-off-by: malay-nagda <malayn@nvidia.com>

Signed-off-by: Malay Nagda <malayn@nvidia.com>

Signed-off-by: malay-nagda <malayn@nvidia.com>

Signed-off-by: Malay Nagda <malayn@nvidia.com>

coderabbitai · 2026-02-09T13:34:46Z

📝 Walkthrough

Walkthrough

Introduces Kimi-K2 pretraining configuration infrastructure across multiple modules: base workload configuration presets for GB300, GB200, B200, and H100 GPUs with multiple precision variants, factory functions to assemble GPU-specific pretrain configurations, conditional exports through the package interface, and updates to override and pipeline layout handling logic.

Changes

Cohort / File(s)	Summary
Kimi-K2 Workload Configuration Presets `scripts/performance/configs/kimi/kimi_workload_base_configs.py`	Defines `BASE_KIMI_K2_CONFIG` and four GPU-variant base configs (GB300, GB200, B200, H100) with multiple precision aliases (BF16, FP8_CS, FP8_MX, NVFP4). Exports all presets via `__all__` list for public consumption.
Kimi-K2 Pretrain Factory Functions `scripts/performance/configs/kimi/kimi_llm_pretrain.py`	Adds `set_kimi_k2_common_configs()` helper and four GPU-specific factory functions (`kimi_k2_pretrain_config_gb300`, `gb200`, `b200`, `h100`) that assemble base configs, apply common settings, configure precision, enable communication overlap, and set dataset parameters.
Package Initialization & Exports `scripts/performance/configs/kimi/__init__.py`	Conditionally imports megatron.bridge; defines `__all__` to export preset constants unconditionally and augments it with factory function aliases when megatron_bridge is available.
Override & Layout Handling `scripts/performance/utils/overrides.py`, `src/megatron/bridge/recipes/kimi/kimi_k2.py`	Updates override logic to compute Kimi-K2 pipeline layout with error handling, restricts bf16 optimizer precision to "adam" only, and guards comm overlap settings against "dist_muon" optimizer. Normalizes `vp_size=None` to 1 in layout lookup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Add refactored recipe files for pretrain configs of LLMs #2067: Adds/adjusts the _get_kimi_k2_pipeline_layout helper and related Kimi-K2 recipe/config export logic that this PR depends on.
DSV3 NVFP4 recipe on GB300 #2076: Touches pipeline-parallel layout handling and workload override plumbing with similar pp_layout changes.
Fix performance config scripts for parameterless recipe API #2201: Modifies and relies on the same Kimi-K2 pipeline-layout helper and performance config patterns.

Suggested labels

Run CICD

Suggested reviewers

erhoo82

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces 366 lines of new Kimi K2 recipe code with core override logic changes but lacks concrete testing evidence, validation results, and performance benchmarks in description.	Complete PR description with test results, regression tests, performance benchmarks, and address identified review issues (invalid PP/VP combinations, missing headers) with full test suite evidence.
Title check	❓ Inconclusive	The title is vague and generic, using non-descriptive terms that don't clearly convey the scope of changes across multiple files and new configuration systems.	Consider a more descriptive title such as 'Add KIMI K2 pretraining configurations and recipe support' to better reflect the multiple new modules, workload configs, and pipeline layout logic being introduced.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch malay/kimi_k2_init

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@scripts/performance/configs/kimi/__init__.py`:
- Around line 1-6: Add the required NVIDIA copyright header to the top of this
Python module (before any imports or code). Update the file that defines
HAVE_MEGATRON_BRIDGE and imports megatron.bridge so the header precedes the
try/except block and remains intact; do not remove or modify the existing logic
that sets HAVE_MEGATRON_BRIDGE or the import of megatron.bridge.

In `@scripts/performance/configs/kimi/kimi_workload_base_configs.py`:
- Line 91: The constant name KIMI_K2_PRETRAIN_CONFIG_H100_FP8_SC appears
inconsistent with the FP8_CS suffix used for other GPUs (e.g.,
GB300/GB200/B200); if "SC" is a typo, rename KIMI_K2_PRETRAIN_CONFIG_H100_FP8_SC
to KIMI_K2_PRETRAIN_CONFIG_H100_FP8_CS to match the convention and update any
references; if "SC" is an intentional different precision variant, add a
clarifying comment next to the definition of KIMI_K2_PRETRAIN_CONFIG_H100_FP8_SC
explaining how SC differs from CS and ensure any related tests/config uses the
correct symbol.
- Around line 78-91: KIMI_K2_PRETRAIN_CONFIG_H100 sets
pipeline_model_parallel_size=16 and virtual_pipeline_model_parallel_size=2 which
has no corresponding entry in _get_kimi_k2_pipeline_layout (called during
kimi_k2_pretrain_config_h100 construction) and will raise ValueError; fix by
either adding a (16, 2) layout entry to the layout map inside
_get_kimi_k2_pipeline_layout in kimi_k2.py or change
KIMI_K2_PRETRAIN_CONFIG_H100 to use a valid VP (e.g., set
virtual_pipeline_model_parallel_size=1 to match the existing (16,1) layout) and
update the related aliases (KIMI_K2_PRETRAIN_CONFIG_H100_BF16 / _FP8_CS /
_FP8_SC) if needed.

In `@scripts/performance/utils/overrides.py`:
- Around line 398-409: When model_recipe_name == "kimi_k2" we must avoid passing
sentinel pp_size (None) or vp_size (-1) into _get_kimi_k2_pipeline_layout;
instead derive effective sizes from the recipe's current layout: read
current_layout = recipe.model.pipeline_model_parallel_layout and compute
effective_pp = pp_size if pp_size is not None else (current_layout[0] if
current_layout else 1) and effective_vp = vp_size if vp_size != -1 else
(current_layout[1] if current_layout else 1), then call
_get_kimi_k2_pipeline_layout(effective_pp, effective_vp) and assign
recipe.model.pipeline_model_parallel_layout accordingly (keeping the existing
exception handling and preserving any explicit pipeline_model_parallel_layout
override afterwards).

🧹 Nitpick comments (10)

src/megatron/bridge/recipes/kimi/kimi_k2.py (3)
25-25: Type hint does not reflect that vp_size can be None.

Line 37 explicitly handles vp_size is None, but the signature declares vp_size: int. Update the type hint per coding guidelines (int | None instead of Optional[int]). As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'".
Proposed fix
-def _get_kimi_k2_pipeline_layout(pp_size: int, vp_size: int):
+def _get_kimi_k2_pipeline_layout(pp_size: int, vp_size: int | None):
44-45: Redundant outer list() call.

[list(x) for x in layout] already produces a list; wrapping it in list(...) is unnecessary.
Proposed fix
-        layout = list([list(x) for x in layout])
+        layout = [list(x) for x in layout]
88-88: Line likely exceeds 119-character limit.

The inline comment on cfg.dataset.blend makes this line very long. Consider moving the usage example into the docstring or a standalone comment above the assignment. As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".
scripts/performance/utils/overrides.py (1)

22-22: Importing a private (_-prefixed) function across package boundaries.

_get_kimi_k2_pipeline_layout is conventionally private to its module. If it's intended to be used by the performance scripts, consider either making it public (removing the _ prefix) or providing a public wrapper. This isn't blocking, just a convention note.
scripts/performance/configs/kimi/kimi_llm_pretrain.py (4)
15-23: Import ordering violates coding guidelines.

Per guidelines, first-party imports (megatron.bridge.*) should come before local folder imports (utils.*). Swap the groups.
Proposed fix
 import logging
 
-from utils.overrides import set_workload_base_configs
-from utils.precision import get_precision_config
-from utils.utils import get_workload_base_config
-
 from megatron.bridge.recipes.kimi.kimi_k2 import _get_kimi_k2_pipeline_layout
 from megatron.bridge.recipes.kimi.kimi_k2 import kimi_k2_pretrain_config as pretrain_config
 from megatron.bridge.training.config import ConfigContainer
+
+from utils.overrides import set_workload_base_configs
+from utils.precision import get_precision_config
+from utils.utils import get_workload_base_config
As per coding guidelines, "Organize imports in order: future imports, standard library, third-party (including megatron.core, torch, transformers), first-party (megatron.bridge.*), local folder imports, separated by blank lines".

45-84: Unused mock parameter in all four factory functions.

The mock parameter is declared but never used (confirmed by static analysis ARG001). Either remove it or use it to conditionally configure mock data. If it's a placeholder for future use, add a brief comment or prefix with _.
Proposed fix (if not needed)
 def kimi_k2_pretrain_config_gb300(
-    precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
+    precision: str = "bf16", config_variant: str = "v1"
 ) -> ConfigContainer:
Also applies to: 87-126, 129-163, 166-201

45-201: Significant code duplication across four GPU config factories.

The four functions are nearly identical — only the GPU name, overlap_grad_reduce, and num_workers/pin_memory differ. Consider extracting a common helper:
Sketch of a refactored approach
def _kimi_k2_pretrain_config(
    gpu: str,
    precision: str = "bf16",
    config_variant: str = "v1",
    overlap_grad_reduce: bool = True,
    num_workers: int | None = None,
    pin_memory: bool | None = None,
) -> ConfigContainer:
    base_cfg = get_workload_base_config(
        model_family_name="kimi",
        model_recipe_name="kimi_k2",
        gpu=gpu,
        compute_dtype=precision.upper(),
        task="pretrain",
        config_variant=config_variant,
    )
    cfg = pretrain_config()
    cfg.mixed_precision = get_precision_config(precision)

    if base_cfg.moe_flex_dispatcher_backend is not None:
        cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backend

    if base_cfg.pp_layout:
        cfg.model.pipeline_model_parallel_layout = base_cfg.pp_layout
    else:
        pp_size = base_cfg.pipeline_model_parallel_size
        vp_size = base_cfg.virtual_pipeline_model_parallel_size
        cfg.model.pipeline_model_parallel_layout = _get_kimi_k2_pipeline_layout(pp_size, vp_size)

    set_kimi_k2_common_configs(cfg)
    set_workload_base_configs(cfg, base_cfg)

    cfg.comm_overlap.overlap_grad_reduce = overlap_grad_reduce
    if num_workers is not None:
        cfg.dataset.num_workers = num_workers
    if pin_memory is not None:
        cfg.dataset.pin_memory = pin_memory

    return cfg
29-43: set_kimi_k2_common_configs has no return value despite modifying cfg in place.

The function signature declares -> None and mutates cfg, which is fine. However, some callers (like the factory functions) don't document that common configs override the recipe's grad_reduce_in_fp32=True to False (lines 38-39). This is a meaningful behavioral change from the recipe default. A brief docstring note would help future maintainers understand the intent.
scripts/performance/configs/kimi/kimi_workload_base_configs.py (1)

40-43: Precision aliases are reference aliases, not distinct configs.

All _BF16, _FP8_CS, etc. variants point to the exact same WorkloadBaseConfig object (e.g., KIMI_K2_PRETRAIN_CONFIG_GB300_BF16 = KIMI_K2_PRETRAIN_CONFIG_GB300). Mutating one would mutate all. Since WorkloadBaseConfig is a frozen dataclass (uses replace), this is likely safe, but consider adding a brief comment explaining that precision is applied at a different layer.

Also applies to: 59-61, 73-75, 89-91
scripts/performance/configs/kimi/__init__.py (1)
8-14: Add # noqa: F401 to suppress false-positive "imported but unused" warnings on re-exports.

Flake8 flags these imports as unused (F401), but they are intentional re-exports for the package's public API. Adding # noqa: F401 will suppress the false positives.
Proposed fix
 if HAVE_MEGATRON_BRIDGE:
-    from .kimi_llm_pretrain import (
-        kimi_k2_pretrain_config_b200,
-        kimi_k2_pretrain_config_gb200,
-        kimi_k2_pretrain_config_gb300,
-        kimi_k2_pretrain_config_h100,
-    )
+    from .kimi_llm_pretrain import (  # noqa: F401
+        kimi_k2_pretrain_config_b200,
+        kimi_k2_pretrain_config_gb200,
+        kimi_k2_pretrain_config_gb300,
+        kimi_k2_pretrain_config_h100,
+    )

scripts/performance/configs/kimi/__init__.py

scripts/performance/configs/kimi/kimi_workload_base_configs.py

scripts/performance/utils/overrides.py

Signed-off-by: Malay Nagda <malayn@nvidia.com>

kimi k2 recipe intro

a61f868

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda requested review from dingqingy-nv and erhoo82 January 28, 2026 09:09

malay-nagda added 2 commits January 28, 2026 17:40

no dispatcher backend

91877de

Signed-off-by: Malay Nagda <malayn@nvidia.com>

Merge branch 'main' into malay/kimi_k2_init

5195fbe

Signed-off-by: malay-nagda <malayn@nvidia.com>

dingqingy-nv added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 3, 2026

Merge branch 'main' into malay/kimi_k2_init

785ae08

malay-nagda changed the base branch from main to r0.3.0 February 5, 2026 10:00

malay-nagda changed the base branch from r0.3.0 to main February 5, 2026 10:01

malay-nagda added 4 commits February 5, 2026 15:49

add back layout logic

2bce880

Signed-off-by: Malay Nagda <malayn@nvidia.com>

Merge branch 'main' into malay/kimi_k2_init

a0793a6

Signed-off-by: malay-nagda <malayn@nvidia.com>

add refactor related changes to perf scripts

43ecbe4

Signed-off-by: Malay Nagda <malayn@nvidia.com>

layout when vp=None

d2aea7e

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda marked this pull request as ready for review February 9, 2026 13:28

copy-pr-bot bot had a problem deploying to nemo-ci February 9, 2026 13:29 Error

malay-nagda requested a review from ko3n1g February 9, 2026 13:29

copy-pr-bot bot had a problem deploying to test February 9, 2026 13:29 Error

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

copyright disclaimer

7a6f3ba

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot bot had a problem deploying to nemo-ci February 9, 2026 13:38 Error

copy-pr-bot bot had a problem deploying to test February 9, 2026 13:38 Error

correct layout call

fceb8a8

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot bot had a problem deploying to nemo-ci February 9, 2026 13:45 Error

copy-pr-bot bot had a problem deploying to test February 9, 2026 13:45 Error

layout for h100

7c0f4c5

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 9, 2026 13:53 Inactive

copy-pr-bot bot temporarily deployed to test February 9, 2026 13:54 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 9, 2026 13:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 11, 2026 13:01 Inactive

copy-pr-bot bot temporarily deployed to test February 11, 2026 13:01 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 11, 2026 13:37 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 11, 2026 13:44 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 11, 2026 13:54 Inactive

dingqingy-nv approved these changes Feb 11, 2026

View reviewed changes

ko3n1g approved these changes Feb 11, 2026

View reviewed changes

ko3n1g mentioned this pull request Feb 24, 2026

260201: Cherrypick various changes #2509

Merged

5 tasks

coderabbitai bot mentioned this pull request Feb 27, 2026

Tune kimi-k2 GB300 MXFP8 recipe #2590

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kimi k2 recipe intro#2097

kimi k2 recipe intro#2097
malay-nagda merged 12 commits intomainfrom
malay/kimi_k2_init

malay-nagda commented Jan 28, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

malay-nagda commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

malay-nagda commented Jan 28, 2026 •

edited

Loading

coderabbitai bot commented Feb 9, 2026 •

edited

Loading