ci(fix): PerfPlugin for llama by ko3n1g · Pull Request #2060 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-01-25T17:17:05Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Refactor
- Improved model identification and environment variable configuration logic during pretraining phase, enhancing consistency and handling of Llama model variants across different sizes and configurations.
- Standardized hardware-specific performance optimizations across multiple systems (h100, gb200, gb300), ensuring more predictable and consistent training behavior with enhanced model classification logic and optimized configuration management for better results.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: oliver könig <okoenig@nvidia.com>

coderabbitai · 2026-01-25T17:19:39Z

📝 Walkthrough

Walkthrough

The change refines model-specific environment variable gating conditions in the performance plugins module. It updates checks for Llama-based models by changing model_family_name comparisons from specific subversions (llama31, llama3) to a unified "llama" check, affecting when hardware-specific NCCL and cuDNN optimizations are applied during pretraining.

Changes

Cohort / File(s)	Change Summary
Environment variable gating refinement `scripts/performance/perf_plugins.py`	Updated model-specific condition checks to use unified model_family_name == "llama" instead of version-specific checks (llama31, llama3\*), affecting applicability of gb200, h100, and gb300 hardware optimizations (NCCL_CTA_POLICY, del_cudnn_ln) during pretraining

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR alters performance-related gating logic for llama-based models without before/after benchmarks or testing details to demonstrate no regression.	Add detailed performance results comparing previous and updated behavior with hardware configuration, model, dataset, and measurement methodology.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'ci(fix): PerfPlugin for llama' is directly related to the changeset, which refines llama model-specific environment variable gating in the PerfPlugin by updating model checks from 'llama31' to 'llama' family patterns.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: mollys <mollys@mollys.nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>

ci(fix): PerfPlugin for llama

a032f39

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g requested a review from malay-nagda January 25, 2026 17:17

copy-pr-bot bot had a problem deploying to nemo-ci January 25, 2026 17:17 Error

copy-pr-bot bot had a problem deploying to test January 25, 2026 17:17 Error

Merge branch 'main' into ko3n1g/ci/fix-perf-plugin

87f830f

copy-pr-bot bot temporarily deployed to nemo-ci January 25, 2026 17:18 Inactive

copy-pr-bot bot temporarily deployed to test January 25, 2026 17:18 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 25, 2026 17:22 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 25, 2026 17:29 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 25, 2026 17:39 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 25, 2026 17:39 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 25, 2026 17:39 Inactive

malay-nagda approved these changes Jan 26, 2026

View reviewed changes

ko3n1g merged commit 93d2395 into main Jan 26, 2026
45 of 47 checks passed

ko3n1g deleted the ko3n1g/ci/fix-perf-plugin branch January 26, 2026 08:16

nv-mollys pushed a commit that referenced this pull request Jan 27, 2026

ci(fix): PerfPlugin for llama (#2060)

6e95de0

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: mollys <mollys@mollys.nvidia.com>

aroshanghias-nvd pushed a commit to aroshanghias-nvd/Megatron-Bridge that referenced this pull request Jan 29, 2026

ci(fix): PerfPlugin for llama (NVIDIA-NeMo#2060)

17e107e

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>

aroshanghias-nvd pushed a commit to aroshanghias-nvd/Megatron-Bridge that referenced this pull request Jan 29, 2026

ci(fix): PerfPlugin for llama (NVIDIA-NeMo#2060)

241e9a4

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Ali Roshan Ghias <aroshanghias@nvidia.com>

This was referenced Feb 3, 2026

cp: Dsv3 Recipe Update (2152) into r0.3.0 #2186

Merged

Revert #2152 and 2209 #2271

Merged

This was referenced Feb 10, 2026

DeepSeek-V3 recipes for H100 #2197

Merged

DeepSeek-V3 recipes for H100 #2312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(fix): PerfPlugin for llama#2060

ci(fix): PerfPlugin for llama#2060
ko3n1g merged 2 commits intomainfrom
ko3n1g/ci/fix-perf-plugin

ko3n1g commented Jan 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ko3n1g commented Jan 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ko3n1g commented Jan 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 25, 2026 •

edited

Loading