Skip to content

Add moonlight 16B recipe#1133

Merged
suiyoubi merged 7 commits intomainfrom
aot/moonlight-recipe
Oct 31, 2025
Merged

Add moonlight 16B recipe#1133
suiyoubi merged 7 commits intomainfrom
aot/moonlight-recipe

Conversation

@suiyoubi
Copy link
Copy Markdown
Contributor

@suiyoubi suiyoubi commented Oct 29, 2025

serialization requires for pp layout

Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Oct 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@suiyoubi suiyoubi requested a review from ananthsub October 29, 2025 21:20
@suiyoubi
Copy link
Copy Markdown
Contributor Author

/ok to test 2426242

Comment on lines +196 to +217
# Sanitize config for WandB by doing a JSON round-trip
# This ensures all objects are converted to basic Python types that WandB can handle
def safe_serialize(obj):
"""Safely convert any object to a JSON-serializable type.

Handles objects with broken __str__ or __repr__ methods that return
non-string types (e.g., PipelineParallelLayerLayout returns list).
"""
try:
# Try str() first
result = str(obj)
# Verify it actually returns a string
if not isinstance(result, str):
# __str__ returned non-string type, use type name instead
return f"<{type(obj).__name__}>"
return result
except Exception:
# __str__ raised an exception, use type name as fallback
return f"<{type(obj).__name__}>"

config_dict = self.cfg.to_dict()
sanitized_config = json.loads(json.dumps(config_dict, default=safe_serialize))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you see any errors with checkpoint saving as well? or was this only an issue with wandb?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint is fine I only observed this serialization issue with wandb.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the root cause of this is the pp layout which is a list

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA/Megatron-LM#2055 for the update to MLM

@suiyoubi suiyoubi added the r0.2.0 Cherry-pick label for r0.2.0 release branch label Oct 30, 2025
@suiyoubi suiyoubi merged commit 5541658 into main Oct 31, 2025
28 checks passed
@suiyoubi suiyoubi deleted the aot/moonlight-recipe branch October 31, 2025 13:28
chtruong814 pushed a commit that referenced this pull request Oct 31, 2025
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.2.0 Cherry-pick label for r0.2.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants