feat: Deepseek migration to Megatron-Bridge + CP support#1059
feat: Deepseek migration to Megatron-Bridge + CP support#1059
Conversation
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
|
Since we're moving Megatron ahead, can you run all the Megatron nightly tests? I can share the baselines from the bridge PR for comparison |
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
logprob_mb_tokens should be an optional seq packing parameter (not needed in SFT) first attempt to test run times more timeout fixes remove train_mv_tokens/logprob_mb_tokens since it's not needed in the init and causes crashes Signed-off-by: Terry Kong <terryk@nvidia.com> fix gemma27b parallel config Signed-off-by: Terry Kong <terryk@nvidia.com> revert steps and time since that hides regression Signed-off-by: Terry Kong <terryk@nvidia.com> revert other configs that hide regression Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
… into yifu/mbridge_dsv3_mcoretot
✅ Submodule Fast-Forward Check ResultsCheck based on commit: d83ba06 (PR #1059 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) All submodule changes look good! ✨ |
✅ Submodule Fast-Forward Check ResultsCheck based on commit: f2e21fc (PR #1059 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) All submodule changes look good! ✨ |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
nemo_rl/models/megatron/common.py (6)
36-44: Return annotation does not match the actual return (5-tuple)Function returns five values but type hints specify four; also second element type is missing.
Apply:
-def _pack_sequences_for_megatron( - input_ids: torch.Tensor, - seq_lengths: torch.Tensor, - pad_individual_seqs_to_multiple_of: int = 1, - pad_packed_seq_to: Optional[int] = None, - cp_rank: int = 0, - cp_size: int = 1, -) -> tuple[torch.Tensor, PackedSeqParams, torch.Tensor, Optional[torch.Tensor]]: +def _pack_sequences_for_megatron( + input_ids: torch.Tensor, + seq_lengths: torch.Tensor, + pad_individual_seqs_to_multiple_of: int = 1, + pad_packed_seq_to: Optional[int] = None, + cp_rank: int = 0, + cp_size: int = 1, +) -> tuple[torch.Tensor, torch.Tensor, PackedSeqParams, torch.Tensor, Optional[torch.Tensor]]:
53-60: Docstring “Returns” order is wrongDocstring says first is packed_input_ids, but code returns all_input_ids first. Fix to prevent misuse.
Apply:
- Tuple of: - - packed_input_ids: Packed input tensor [1, T] - - input_ids_cp_sharded: Sharded input tensor [cp_size, T // cp_size] + Tuple of: + - all_input_ids: Concatenated (unsharded) input tensor [1, T_total] + - input_ids_cp_sharded: CP-sharded packed input tensor for this rank [1, T_shard] - packed_seq_params: PackedSeqParams object - cu_seqlens: Cumulative sequence lengths - cu_seqlens_padded: Padded cumulative sequence lengths
72-91: CP padding invariant not enforced (can break CP semantics)When cp_size > 1, each sequence should be padded to multiple of cp_size*2 (per your Notes), but pad_factor is driven solely by pad_individual_seqs_to_multiple_of. This can cause misalignment between pack/unpack and TE kernels.
Apply:
- pad_factor = pad_individual_seqs_to_multiple_of + pad_factor = pad_individual_seqs_to_multiple_of + if cp_size > 1 and pad_factor <= 1: + # Ensure per-seq padding meets CP requirements for causal attention load balancing + pad_factor = cp_size * 2And consider documenting that callers can override via pad_individual_seqs_to_multiple_of.
121-129: Last-sequence padding can go negativeWhen pad_packed_seq_to < running_seq_len, padded_seq_len becomes negative and corrupts running_seq_len. Add a check.
Apply:
if b == batch_size - 1 and pad_packed_seq_to is not None: - padded_seq_len = pad_packed_seq_to - running_seq_len + extra = pad_packed_seq_to - running_seq_len + if extra < 0: + raise ValueError( + f"pad_packed_seq_to ({pad_packed_seq_to}) smaller than accumulated length ({running_seq_len})" + ) + padded_seq_len = extra
355-361: Guard against temperature=0Division by zero possible if config sets temperature: 0.
Apply:
- output_tensor.div_(policy_cfg["generation"]["temperature"]) + temp = policy_cfg["generation"].get("temperature", 1.0) + if temp and temp != 1.0: + output_tensor.div_(temp)
422-446: broadcast_tensor uses global ranks with group operationsdist.broadcast(_object_list|tensor) expects src to be the rank within the given group. Here, src_rank is documented and used as a global rank, which can target the wrong process.
Minimal fix: convert global src to group src before broadcasts, or change the API to accept src_group_rank. Example conversion:
- if rank == src_rank: + group_rank = dist.get_rank(group) + # Convert global to group rank if supported; otherwise, require group-local rank + try: + src_group_rank = dist.get_group_rank(group, src_rank) # PyTorch 2.3+ + except Exception: + src_group_rank = src_rank # Fallback: assume caller passed group-local rank + + if group_rank == src_group_rank: if tensor is None: raise ValueError(f"Rank {rank} is source ({src_rank}) but tensor is None.") @@ - dist.broadcast_object_list(object_list, src=src_rank, group=group) + dist.broadcast_object_list(object_list, src=src_group_rank, group=group) @@ - dist.broadcast(tensor, src=src_rank, group=group) + dist.broadcast(tensor, src=src_group_rank, group=group)Also update the docstring to clarify src is group-local.
tools/refit_verifier.py (1)
357-363: Fix tokenizer call: padding_side is not a valid encode kwargSet tokenizer.padding_side before calling, instead of passing as an argument.
Apply:
- tokenized = tokenizer( - [prompt], - padding=True, - truncation=True, - return_tensors="pt", - padding_side="right", - ) + tokenizer.padding_side = "right" + tokenized = tokenizer( + [prompt], + padding=True, + truncation=True, + return_tensors="pt", + )nemo_rl/models/policy/megatron_policy_worker.py (1)
772-779: Gate trust_remote_code for tokenizer and AutoBridgeSame concern: make this configurable rather than always True.
Apply:
- self.megatron_tokenizer = build_tokenizer( + self.megatron_tokenizer = build_tokenizer( tokenizer_config, make_vocab_size_divisible_by=self.megatron_cfg.model.make_vocab_size_divisible_by // self.cfg["megatron_cfg"]["tensor_model_parallel_size"], tensor_model_parallel_size=self.cfg["megatron_cfg"][ "tensor_model_parallel_size" ], - trust_remote_code=True, + trust_remote_code=self.cfg.get("trust_remote_code", False), ) @@ - self.megatron_bridge = AutoBridge.from_hf_pretrained( - hf_model_name, trust_remote_code=True - ) + self.megatron_bridge = AutoBridge.from_hf_pretrained( + hf_model_name, trust_remote_code=self.cfg.get("trust_remote_code", False) + )
🧹 Nitpick comments (14)
examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml (2)
65-66: Nit: normalize boolean casing.This file mixes True/False and true/false. Consider standardizing (e.g., all lower-case) for consistency across configs.
48-76: Optional: centralize repeated MoE defaults.Given this flag is added across many configs, consider a shared partial (e.g., examples/configs/_partials/megatron_moe_defaults.yaml) and reference via defaults to reduce drift.
examples/configs/grpo_math_1B_megatron.yaml (2)
90-95: Optional: clarify comment scope and doc.The “~20% perf speedup with sequence packing” comment sits above apply_rope_fusion. If that claim doesn’t pertain to moe_permute_fusion, consider a brief note for moe_permute_fusion (e.g., when to enable, known hardware/TE constraints).
74-136: Optional: deduplicate Megatron MoE block across recipes.Same suggestion as in the other file—extract shared megatron_cfg MoE defaults to a partial to keep one source of truth (freeze_moe_router, router dtype/type, moe_permute_fusion, etc.).
3rdparty/Megatron-LM-workspace/Megatron-LM (1)
1-1: Prefer pinning to a stable tag or documenting a lockfile mapping; add a compat note.Current pin is a raw SHA from a feature branch lineage. Consider:
- Pinning to an annotated tag (if available) or
- Keeping the SHA but recording it in a third_party.lock with a short “compat matrix” (Megatron-LM SHA ↔ Megatron-Bridge SHA ↔ TE version) to reduce drift and ease future rebases.
examples/configs/dpo.yaml (1)
112-112: Add a short doc comment and keep style consistent for booleans.
- Please add a one-liner above to explain when to enable moe_permute_fusion (HW/TE/Megatron-Bridge requirements, known limitations).
- Minor: booleans in this block mix True/true/false. Consider normalizing within megatron_cfg for readability.
Proposed tweak:
- moe_permute_fusion: false + # Enables fused token<->expert permutation kernels for MoE (requires recent Megatron-Bridge/TE). + moe_permute_fusion: falseexamples/configs/recipes/llm/grpo-qwen3-30ba3b-8n8g-megatron.yaml (1)
76-76: Confirm interaction with freeze_moe_router=true and EP=4.With expert_model_parallel_size: 4 and freeze_moe_router: true, fused permute should be safe but offers limited benefit if routing is static. Keep false by default unless nightly perf shows gains; add a note inline for when to flip.
Suggested comment:
- moe_permute_fusion: false + # Set true if kernels are available; verify with GRPO nightly (EP=4, TP=4, PP=4). + moe_permute_fusion: falseexamples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml (1)
74-74: FP8 path: gate fused permute on TE 2.6+ only.This recipe targets FP8 vLLM training. Please ensure fused permute kernels are compatible with your TE version; otherwise leaving false is correct. Add a guard note to avoid accidental enablement.
- moe_permute_fusion: false + # FP8: enable only with TE >= 2.6 and validated kernels. + moe_permute_fusion: falseexamples/configs/grpo_math_1B.yaml (1)
77-77: Consistency + docs: add a brief note and align naming in docs/tools.
- Add a brief inline comment on the flag purpose.
- Ensure tools/refit_verifier.py and README mention moe_permute_fusion to keep configs, tooling, and docs in sync.
- moe_permute_fusion: false + # Toggle fused MoE permutation kernels (perf). Default off for safety. + moe_permute_fusion: falsenemo_rl/models/megatron/common.py (1)
425-475: Device handling assumes CUDA; add CPU-safe fallbacktorch.cuda.current_device() will fail on CPU-only runs. Use the source tensor’s device from metadata, or default to CPU when CUDA is unavailable.
Apply:
- device = torch.cuda.current_device() + device = torch.device("cuda", torch.cuda.current_device()) if torch.cuda.is_available() else torch.device("cpu")And prefer allocating tensors on the received device encoded in metadata.
tools/refit_verifier.py (3)
158-161: Align Megatron vs vLLM max_new_tokensMegatron config sets max_new_tokens to args.max_sequence_length while vLLM uses args.max_new_tokens. This can skew comparisons. Recommend using the same source.
Apply:
- "max_new_tokens": args.max_sequence_length, + "max_new_tokens": args.max_new_tokens,
279-281: Use boolean for enforce_eagerMinor: value is a string; prefer a boolean.
Apply:
- "enforce_eager": "False", + "enforce_eager": False,
424-431: Avoid dumping large structures to stdoutThe raw print of megatron_input_data and vllm_logprobs_data can be huge. Consider gating under a --verbose flag.
Also applies to: 425-431
nemo_rl/models/policy/megatron_policy_worker.py (1)
1169-1177: Prefer tokenizer ids over hard-coded zerosPass actual pad/eos token IDs to get_ltor_masks_and_position_ids for correctness across models.
Apply:
- attention_mask, _, position_ids = get_ltor_masks_and_position_ids( - data=input_ids, - eod_token=0, # used for loss_mask, which we don't use - pad_token=0, # used for loss_mask, which we don't use + attention_mask, _, position_ids = get_ltor_masks_and_position_ids( + data=input_ids, + eod_token=self.tokenizer.eos_token_id or 0, + pad_token=self.tokenizer.pad_token_id or 0, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, pad_mask_loss=False, )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
.gitmodules(1 hunks)3rdparty/Megatron-Bridge-workspace/Megatron-Bridge(1 hunks)3rdparty/Megatron-LM-workspace/Megatron-LM(1 hunks)examples/configs/dpo.yaml(1 hunks)examples/configs/grpo_math_1B.yaml(1 hunks)examples/configs/grpo_math_1B_megatron.yaml(1 hunks)examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml(1 hunks)examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml(1 hunks)examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml(1 hunks)examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml(1 hunks)examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml(1 hunks)examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron.yaml(1 hunks)examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml(1 hunks)examples/configs/recipes/llm/grpo-qwen3-30ba3b-8n8g-megatron.yaml(1 hunks)examples/configs/recipes/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.yaml(1 hunks)examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.yaml(1 hunks)examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml(1 hunks)examples/configs/sft.yaml(1 hunks)examples/configs/sft_openmathinstruct2_megatron.yaml(1 hunks)nemo_rl/models/megatron/common.py(1 hunks)nemo_rl/models/policy/megatron_policy_worker.py(5 hunks)tests/unit/models/generation/test_vllm_generation.py(1 hunks)tests/unit/models/policy/test_megatron_worker.py(1 hunks)tools/refit_verifier.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Lint check
🔇 Additional comments (21)
examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml (1)
65-65: Confirm dense-path compatibility
Wiring ofmoe_permute_fusionis implemented in nemo_rl/models/policy/megatron_policy_worker.py (line 566). Please run E2E tests to ensure dense (non-MoE) code paths tolerate this flag without side effects.examples/configs/grpo_math_1B_megatron.yaml (1)
92-92: Resolve moe_permute_fusion flag presence
Themoe_permute_fusion: falseline is already present (line 92), matching other configs and placed correctly aftermoe_router_bias_update_rate. No further changes needed.examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml (1)
55-55: Verifymoe_permute_fusionplumbing with sequence_parallel=False and pipeline_model_parallel_size=2. Cannot confirm the flag is actually consumed by the worker in this topology—please ensure it’s passed through and that the corresponding kernels exist; if not, keep it false and document.3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1)
1-1: Megatron-Bridge submodule bump verified—next steps
- Verified that Megatron-Bridge is pinned to b99e1ca4e45d648fd8e2bffd63f5a41ede04c3da and Megatron-LM to 383d1144c3b3f77096c63b7308402a0ea6ba47dd, and .gitmodules correctly tracks the yifu/nemo-rl-use-chunkpatch-ds branch as shallow.
- Manually confirm that this Megatron-Bridge commit remains API-compatible with the Megatron-LM bump.
- Once PR 440 merges upstream, re-point .gitmodules to a stable branch/tag and pin to a released commit for reproducibility.
- Ensure CI runs
git submodule sync && git submodule update --init --recursiveso nightlies and CP jobs use the exact pointers.- Add a brief note to RELEASE_NOTES or README explaining the temporary branch tracking and the plan for moving to an upstream tag.
examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml (1)
59-59: Adds moe_permute_fusion flag — OKFlag wiring looks consistent with other configs and worker code. No issues.
examples/configs/sft.yaml (1)
89-89: Adds moe_permute_fusion flag — OKMatches the repo-wide pattern; safe even with megatron_cfg.enabled: false.
examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml (1)
65-65: Adds moe_permute_fusion flag — OKPlacement and default align with other configs.
examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml (1)
80-81: Adds moe_permute_fusion flag — OKConsistent with test and tooling updates.
examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml (1)
61-61: Adds moe_permute_fusion flag — OKNo concerns.
examples/configs/recipes/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.yaml (1)
45-45: Adds moe_permute_fusion flag — OKKeeps MoE options consistent across large-model recipes.
.gitmodules (1)
4-4: Submodule branches point to feature branches — please confirm CI reproducibilityTracking yuya/nemo-rl-use-2 and yifu/nemo-rl-use-chunkpatch-ds can drift. Ensure:
- Submodule SHAs in the superproject are pinned to the tested commits.
- Nightly/baseline runs are captured for these exact SHAs.
Would you like a small script to print current submodule SHAs and compare against the bridge PR baselines?
Also applies to: 9-9
nemo_rl/models/megatron/common.py (2)
94-99: Potential mismatch: cu_seqlens_padded[-1] forced without validating total padded lengthIf pad_packed_seq_to < already-accumulated padded total, cu_seqlens_padded will become inconsistent with tensors.
Apply a guard:
- if pad_packed_seq_to is not None: - cu_seqlens_padded[-1] = pad_packed_seq_to + if pad_packed_seq_to is not None: + if pad_packed_seq_to < cu_seqlens_padded[-1]: + raise ValueError( + f"pad_packed_seq_to ({pad_packed_seq_to}) < total padded length ({int(cu_seqlens_padded[-1])})" + ) + cu_seqlens_padded[-1] = pad_packed_seq_to
229-247: Unpack path ignores CP when cu_seqlens_padded is NoneFor CP>1 with no per-seq padding, cu_seqlens_padded can be None, and this branch treats it as “no CP,” yet output_tensor is CP-sharded. This will mis-index.
Either:
- Always set cu_seqlens_padded=cu_seqlens.clone() when CP>1, or
- Gate unpacking on CP explicitly (use group/world CP metadata), or
- Document that unpack requires per-seq padding when CP>1.
Would you like a patch to always clone cu_seqlens when cp_size>1?tools/refit_verifier.py (1)
213-215: Add of moe_permute_fusion looks goodFlag is plumbed consistently with worker usage. No concerns.
examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron.yaml (1)
88-91: Add of moe_permute_fusion is consistentMatches worker/config plumbing elsewhere. Good default to false.
tests/unit/models/policy/test_megatron_worker.py (1)
98-101: Test config updated with moe_permute_fusionLooks correct and keeps tests aligned with runtime config.
examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.yaml (1)
55-58: Add of moe_permute_fusion is fineConsistent with other YAMLs and worker mapping.
examples/configs/sft_openmathinstruct2_megatron.yaml (1)
85-88: Add of moe_permute_fusion is fineKeeps configs uniform.
tests/unit/models/generation/test_vllm_generation.py (1)
168-171: Add of moe_permute_fusion in Megatron test configMatches policy worker expectations. Looks good.
nemo_rl/models/policy/megatron_policy_worker.py (2)
64-66: FSDP adapter import path change — verify submodule compatibilityEnsure the updated Megatron-LM/Megatron-Bridge pointers include megatron.core.distributed.fsdp.mcore_fsdp_adapter and CI runs Megatron nightlies per reviewer request.
Would you run the Megatron nightly suite and the RL unit tests on this PR branch to confirm the import path exists across all test matrices?
566-567: moe_permute_fusion is correctly wired into model_cfgMapping from cfg.megatron_cfg to model_cfg is clear. No concerns.
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
✅ Submodule Fast-Forward Check ResultsCheck based on commit: 66bd252 (PR #1059 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) All submodule changes look good! ✨ |
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
✅ Submodule Fast-Forward Check ResultsCheck based on commit: 32eaed7 (PR #1059 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) All submodule changes look good! ✨ |
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
✅ Submodule Fast-Forward Check ResultsCheck based on commit: 26f4f4d (PR #1059 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) All submodule changes look good! ✨ |
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
✅ Submodule Fast-Forward Check ResultsCheck based on commit: b1847d1 (PR #1059 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) All submodule changes look good! ✨ |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
3rdparty/Megatron-Bridge-workspace/setup.py (3)
48-48: Don’t shipmegatron-core’s dev extras in runtime deps.Pulling
devdrags test/build tooling into user installs and increases resolver churn. Keepmlmonly; expose dev via a dedicated extras or docs.- "megatron-core[dev,mlm]>=0.14.0a0,<0.16.0", + "megatron-core[mlm]>=0.14.0a0,<0.16.0",If the submodule’s pyproject currently includes
[dev,mlm], update it in tandem so the consistency check still passes.
16-17: Guard toml parsing for Python <3.11 and avoid hard fail when parser isn’t available.
tomllibis 3.11+. On older Pythons this will crash before we can even decide to skip the check. Add atomlifallback and skip the consistency check if neither is present.-import tomllib +try: + import tomllib # py311+ + _HAVE_TOMLLIB = True +except ModuleNotFoundError: + try: + import tomli as tomllib # py<=3.10 fallback + _HAVE_TOMLLIB = True + except ModuleNotFoundError: + _HAVE_TOMLLIB = False @@ -if os.path.exists(bridge_src_dir): +if os.path.exists(bridge_src_dir) and _HAVE_TOMLLIB: @@ - with open(pyproject_path, "rb") as f: - data = tomllib.load(f) + with open(pyproject_path, "rb") as f: + data = tomllib.load(f) @@ - else: + else: print( "[megatron-bridge][setup] Dependency sets are consistent with the submodule pyproject.", file=sys.stderr, ) +elif os.path.exists(bridge_src_dir): + print( + "[megatron-bridge][setup] tomllib/tomli not available; skipping dependency consistency check.", + file=sys.stderr, + )If you prefer enforcing the check, add a conditional dep instead:
tomli>=2.0.1; python_version<'3.11'to both this file and the submodule pyproject so the set comparison still matches.Also applies to: 55-65, 72-101
52-53: Pin a minimalfilelockversion (and ensure submodule pyproject matches).Unpinned
filelockcan resolve very old releases; set a floor for reproducibility and to ensure modern APIs.- "filelock", + "filelock>=3.12",Confirm the same constraint exists in
Megatron-Bridge/pyproject.tomlto keep the consistency check passing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge(1 hunks)3rdparty/Megatron-Bridge-workspace/setup.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
🔇 Additional comments (1)
3rdparty/Megatron-Bridge-workspace/setup.py (1)
47-49: Verify megatron-core version range
The install_requires pin in setup.py (megatron-core[dev,mlm]>=0.14.0a0,<0.16.0) must cover the APIs introduced after commit d07e4e5—if your changes rely on 0.15.x functionality, tighten it to>=0.15.0,<0.16.0.
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
✅ Submodule Fast-Forward Check ResultsCheck based on commit: 2999f2b (PR #1059 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) All submodule changes look good! ✨ |
…#1059) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Ahmad Kiswani <kiswani.ahmad@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
…#1059) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Ahmad Kiswani <kiswani.ahmad@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
What does this PR do ?
This PR provides deepseek support using Megatron-Bridge. It also adds CP support for deepseek by using the latest mcore. This PR also includes an update to the latest mcore (taken from #1042).
Closes #1043
Closes #821
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Chores
Tests