Skip to content

kimi k2 recipe intro#2097

Merged
malay-nagda merged 12 commits intomainfrom
malay/kimi_k2_init
Feb 11, 2026
Merged

kimi k2 recipe intro#2097
malay-nagda merged 12 commits intomainfrom
malay/kimi_k2_init

Conversation

@malay-nagda
Copy link
Contributor

@malay-nagda malay-nagda commented Jan 28, 2026

What does this PR do ?

Add Kimi-K2 performance recipes.

Also adds updates to-

  • accommodate different optimizer
  • PP layout when VP=None
  • PP layout for user provided PP-VP overrides.

Changelog

__all__ = [
    "KIMI_K2_PRETRAIN_CONFIG_B200_BF16",
    "KIMI_K2_PRETRAIN_CONFIG_B200_FP8_CS",
    "KIMI_K2_PRETRAIN_CONFIG_B200_FP8_MX",
    "KIMI_K2_PRETRAIN_CONFIG_GB200_BF16",
    "KIMI_K2_PRETRAIN_CONFIG_GB200_FP8_CS",
    "KIMI_K2_PRETRAIN_CONFIG_GB200_FP8_MX",
    "KIMI_K2_PRETRAIN_CONFIG_GB300_BF16",
    "KIMI_K2_PRETRAIN_CONFIG_GB300_FP8_CS",
    "KIMI_K2_PRETRAIN_CONFIG_GB300_FP8_MX",
    "KIMI_K2_PRETRAIN_CONFIG_GB300_NVFP4",
    "KIMI_K2_PRETRAIN_CONFIG_H100_BF16",
    "KIMI_K2_PRETRAIN_CONFIG_H100_FP8_CS",
    "KIMI_K2_PRETRAIN_CONFIG_H100_FP8_SC",
]

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Added comprehensive pretraining configuration support for KIMI-K2 model across GB300, GB200, B200, and H100 GPU types
    • Support for multiple precision formats including BF16, FP8_CS, FP8_MX, and NVFP4
  • Bug Fixes

    • Improved pipeline layout computation and error handling for specific configurations
    • Refined optimizer precision handling for specific optimizer and precision combinations

Signed-off-by: Malay Nagda <malayn@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: malay-nagda <malayn@nvidia.com>
@dingqingy-nv dingqingy-nv added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 3, 2026
@malay-nagda malay-nagda changed the base branch from main to r0.3.0 February 5, 2026 10:00
@malay-nagda malay-nagda changed the base branch from r0.3.0 to main February 5, 2026 10:01
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: malay-nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

Introduces Kimi-K2 pretraining configuration infrastructure across multiple modules: base workload configuration presets for GB300, GB200, B200, and H100 GPUs with multiple precision variants, factory functions to assemble GPU-specific pretrain configurations, conditional exports through the package interface, and updates to override and pipeline layout handling logic.

Changes

Cohort / File(s) Summary
Kimi-K2 Workload Configuration Presets
scripts/performance/configs/kimi/kimi_workload_base_configs.py
Defines BASE_KIMI_K2_CONFIG and four GPU-variant base configs (GB300, GB200, B200, H100) with multiple precision aliases (BF16, FP8_CS, FP8_MX, NVFP4). Exports all presets via __all__ list for public consumption.
Kimi-K2 Pretrain Factory Functions
scripts/performance/configs/kimi/kimi_llm_pretrain.py
Adds set_kimi_k2_common_configs() helper and four GPU-specific factory functions (kimi_k2_pretrain_config_gb300, gb200, b200, h100) that assemble base configs, apply common settings, configure precision, enable communication overlap, and set dataset parameters.
Package Initialization & Exports
scripts/performance/configs/kimi/__init__.py
Conditionally imports megatron.bridge; defines __all__ to export preset constants unconditionally and augments it with factory function aliases when megatron_bridge is available.
Override & Layout Handling
scripts/performance/utils/overrides.py, src/megatron/bridge/recipes/kimi/kimi_k2.py
Updates override logic to compute Kimi-K2 pipeline layout with error handling, restricts bf16 optimizer precision to "adam" only, and guards comm overlap settings against "dist_muon" optimizer. Normalizes vp_size=None to 1 in layout lookup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Run CICD

Suggested reviewers

  • erhoo82
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces 366 lines of new Kimi K2 recipe code with core override logic changes but lacks concrete testing evidence, validation results, and performance benchmarks in description. Complete PR description with test results, regression tests, performance benchmarks, and address identified review issues (invalid PP/VP combinations, missing headers) with full test suite evidence.
Title check ❓ Inconclusive The title is vague and generic, using non-descriptive terms that don't clearly convey the scope of changes across multiple files and new configuration systems. Consider a more descriptive title such as 'Add KIMI K2 pretraining configurations and recipe support' to better reflect the multiple new modules, workload configs, and pipeline layout logic being introduced.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch malay/kimi_k2_init

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@scripts/performance/configs/kimi/__init__.py`:
- Around line 1-6: Add the required NVIDIA copyright header to the top of this
Python module (before any imports or code). Update the file that defines
HAVE_MEGATRON_BRIDGE and imports megatron.bridge so the header precedes the
try/except block and remains intact; do not remove or modify the existing logic
that sets HAVE_MEGATRON_BRIDGE or the import of megatron.bridge.

In `@scripts/performance/configs/kimi/kimi_workload_base_configs.py`:
- Line 91: The constant name KIMI_K2_PRETRAIN_CONFIG_H100_FP8_SC appears
inconsistent with the FP8_CS suffix used for other GPUs (e.g.,
GB300/GB200/B200); if "SC" is a typo, rename KIMI_K2_PRETRAIN_CONFIG_H100_FP8_SC
to KIMI_K2_PRETRAIN_CONFIG_H100_FP8_CS to match the convention and update any
references; if "SC" is an intentional different precision variant, add a
clarifying comment next to the definition of KIMI_K2_PRETRAIN_CONFIG_H100_FP8_SC
explaining how SC differs from CS and ensure any related tests/config uses the
correct symbol.
- Around line 78-91: KIMI_K2_PRETRAIN_CONFIG_H100 sets
pipeline_model_parallel_size=16 and virtual_pipeline_model_parallel_size=2 which
has no corresponding entry in _get_kimi_k2_pipeline_layout (called during
kimi_k2_pretrain_config_h100 construction) and will raise ValueError; fix by
either adding a (16, 2) layout entry to the layout map inside
_get_kimi_k2_pipeline_layout in kimi_k2.py or change
KIMI_K2_PRETRAIN_CONFIG_H100 to use a valid VP (e.g., set
virtual_pipeline_model_parallel_size=1 to match the existing (16,1) layout) and
update the related aliases (KIMI_K2_PRETRAIN_CONFIG_H100_BF16 / _FP8_CS /
_FP8_SC) if needed.

In `@scripts/performance/utils/overrides.py`:
- Around line 398-409: When model_recipe_name == "kimi_k2" we must avoid passing
sentinel pp_size (None) or vp_size (-1) into _get_kimi_k2_pipeline_layout;
instead derive effective sizes from the recipe's current layout: read
current_layout = recipe.model.pipeline_model_parallel_layout and compute
effective_pp = pp_size if pp_size is not None else (current_layout[0] if
current_layout else 1) and effective_vp = vp_size if vp_size != -1 else
(current_layout[1] if current_layout else 1), then call
_get_kimi_k2_pipeline_layout(effective_pp, effective_vp) and assign
recipe.model.pipeline_model_parallel_layout accordingly (keeping the existing
exception handling and preserving any explicit pipeline_model_parallel_layout
override afterwards).
🧹 Nitpick comments (10)
src/megatron/bridge/recipes/kimi/kimi_k2.py (3)

25-25: Type hint does not reflect that vp_size can be None.

Line 37 explicitly handles vp_size is None, but the signature declares vp_size: int. Update the type hint per coding guidelines (int | None instead of Optional[int]). As per coding guidelines, "Use 'T | None' for nullable types instead of 'Optional[T]'".

Proposed fix
-def _get_kimi_k2_pipeline_layout(pp_size: int, vp_size: int):
+def _get_kimi_k2_pipeline_layout(pp_size: int, vp_size: int | None):

44-45: Redundant outer list() call.

[list(x) for x in layout] already produces a list; wrapping it in list(...) is unnecessary.

Proposed fix
-        layout = list([list(x) for x in layout])
+        layout = [list(x) for x in layout]

88-88: Line likely exceeds 119-character limit.

The inline comment on cfg.dataset.blend makes this line very long. Consider moving the usage example into the docstring or a standalone comment above the assignment. As per coding guidelines, "Maximum line length is 119 characters (matching ruff configuration)".

scripts/performance/utils/overrides.py (1)

22-22: Importing a private (_-prefixed) function across package boundaries.

_get_kimi_k2_pipeline_layout is conventionally private to its module. If it's intended to be used by the performance scripts, consider either making it public (removing the _ prefix) or providing a public wrapper. This isn't blocking, just a convention note.

scripts/performance/configs/kimi/kimi_llm_pretrain.py (4)

15-23: Import ordering violates coding guidelines.

Per guidelines, first-party imports (megatron.bridge.*) should come before local folder imports (utils.*). Swap the groups.

Proposed fix
 import logging
 
-from utils.overrides import set_workload_base_configs
-from utils.precision import get_precision_config
-from utils.utils import get_workload_base_config
-
 from megatron.bridge.recipes.kimi.kimi_k2 import _get_kimi_k2_pipeline_layout
 from megatron.bridge.recipes.kimi.kimi_k2 import kimi_k2_pretrain_config as pretrain_config
 from megatron.bridge.training.config import ConfigContainer
+
+from utils.overrides import set_workload_base_configs
+from utils.precision import get_precision_config
+from utils.utils import get_workload_base_config

As per coding guidelines, "Organize imports in order: future imports, standard library, third-party (including megatron.core, torch, transformers), first-party (megatron.bridge.*), local folder imports, separated by blank lines".


45-84: Unused mock parameter in all four factory functions.

The mock parameter is declared but never used (confirmed by static analysis ARG001). Either remove it or use it to conditionally configure mock data. If it's a placeholder for future use, add a brief comment or prefix with _.

Proposed fix (if not needed)
 def kimi_k2_pretrain_config_gb300(
-    precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
+    precision: str = "bf16", config_variant: str = "v1"
 ) -> ConfigContainer:

Also applies to: 87-126, 129-163, 166-201


45-201: Significant code duplication across four GPU config factories.

The four functions are nearly identical — only the GPU name, overlap_grad_reduce, and num_workers/pin_memory differ. Consider extracting a common helper:

Sketch of a refactored approach
def _kimi_k2_pretrain_config(
    gpu: str,
    precision: str = "bf16",
    config_variant: str = "v1",
    overlap_grad_reduce: bool = True,
    num_workers: int | None = None,
    pin_memory: bool | None = None,
) -> ConfigContainer:
    base_cfg = get_workload_base_config(
        model_family_name="kimi",
        model_recipe_name="kimi_k2",
        gpu=gpu,
        compute_dtype=precision.upper(),
        task="pretrain",
        config_variant=config_variant,
    )
    cfg = pretrain_config()
    cfg.mixed_precision = get_precision_config(precision)

    if base_cfg.moe_flex_dispatcher_backend is not None:
        cfg.model.moe_flex_dispatcher_backend = base_cfg.moe_flex_dispatcher_backend

    if base_cfg.pp_layout:
        cfg.model.pipeline_model_parallel_layout = base_cfg.pp_layout
    else:
        pp_size = base_cfg.pipeline_model_parallel_size
        vp_size = base_cfg.virtual_pipeline_model_parallel_size
        cfg.model.pipeline_model_parallel_layout = _get_kimi_k2_pipeline_layout(pp_size, vp_size)

    set_kimi_k2_common_configs(cfg)
    set_workload_base_configs(cfg, base_cfg)

    cfg.comm_overlap.overlap_grad_reduce = overlap_grad_reduce
    if num_workers is not None:
        cfg.dataset.num_workers = num_workers
    if pin_memory is not None:
        cfg.dataset.pin_memory = pin_memory

    return cfg

29-43: set_kimi_k2_common_configs has no return value despite modifying cfg in place.

The function signature declares -> None and mutates cfg, which is fine. However, some callers (like the factory functions) don't document that common configs override the recipe's grad_reduce_in_fp32=True to False (lines 38-39). This is a meaningful behavioral change from the recipe default. A brief docstring note would help future maintainers understand the intent.

scripts/performance/configs/kimi/kimi_workload_base_configs.py (1)

40-43: Precision aliases are reference aliases, not distinct configs.

All _BF16, _FP8_CS, etc. variants point to the exact same WorkloadBaseConfig object (e.g., KIMI_K2_PRETRAIN_CONFIG_GB300_BF16 = KIMI_K2_PRETRAIN_CONFIG_GB300). Mutating one would mutate all. Since WorkloadBaseConfig is a frozen dataclass (uses replace), this is likely safe, but consider adding a brief comment explaining that precision is applied at a different layer.

Also applies to: 59-61, 73-75, 89-91

scripts/performance/configs/kimi/__init__.py (1)

8-14: Add # noqa: F401 to suppress false-positive "imported but unused" warnings on re-exports.

Flake8 flags these imports as unused (F401), but they are intentional re-exports for the package's public API. Adding # noqa: F401 will suppress the false positives.

Proposed fix
 if HAVE_MEGATRON_BRIDGE:
-    from .kimi_llm_pretrain import (
-        kimi_k2_pretrain_config_b200,
-        kimi_k2_pretrain_config_gb200,
-        kimi_k2_pretrain_config_gb300,
-        kimi_k2_pretrain_config_h100,
-    )
+    from .kimi_llm_pretrain import (  # noqa: F401
+        kimi_k2_pretrain_config_b200,
+        kimi_k2_pretrain_config_gb200,
+        kimi_k2_pretrain_config_gb300,
+        kimi_k2_pretrain_config_h100,
+    )

Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants