-
Notifications
You must be signed in to change notification settings - Fork 390
Add refactored recipe files for pretrain configs of LLMs #2067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
687538d
Add refactored recipe for qwen2, qwen3
athitten c058e98
Add refactored recipe for qwen3_moe, qwen3_next
athitten 607c76e
Add refactored recipe for llama2, llama3
athitten 3aee7d6
Add deepseek_v2, deepseek_v3 refactored recipe
athitten 98875bc
Add refactored recipe for gemma2, gemma3
athitten 5a639b1
Add refactored recipe for glm45
athitten 7bd57e5
Add GPT OSS refactored recipe
athitten e56fbf5
Add kimi k2 refactored recipe
athitten 127efed
Add refactored recipe for moonlight_16b
athitten 7fd9fd8
Add refactored recipe nemotron nano v2
athitten ea7120a
Add refactored recipe for Nemotron-H
athitten b571a30
Add OLMoE 7B refactored config
athitten 6c05491
Add refactored recipe for GPt3 175B
athitten 59ede93
Add new pretrain_configs to the recipes and remove *_new.py recipes
athitten d3d44b7
Fix lint errors
athitten 68bce75
Add missing _model_config and remove comments
athitten 30b1680
Add dataset comment
athitten e40cd81
Remove unwanted comments
athitten 5376714
Fix lint error
athitten 84b4026
Fix recipe tests test_gpt3_175b.py,test_kimi_k2.py
athitten 16e074b
Fix more tests
athitten b48c1a5
Fix nemotron recipe unit tests
athitten 6aea639
Add refactored recipe for nemotron nanov3 and update test file
athitten 4c6be48
Fix functional tests
athitten 63e7311
fix lint check error
athitten 54548b5
Fix functional tests CI errors
athitten 7c9f47d
Fix for test_qat_workflow.py
athitten ad67f81
Add lint check fix
athitten File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import os | ||
|
|
||
| from megatron.core.distributed import DistributedDataParallelConfig | ||
|
|
||
| from megatron.bridge.recipes.utils.optimizer_utils import distributed_fused_adam_with_cosine_annealing | ||
| from megatron.bridge.training.config import ( | ||
| CheckpointConfig, | ||
| ConfigContainer, | ||
| DistributedInitConfig, | ||
| GPTDatasetConfig, | ||
| LoggerConfig, | ||
| RNGConfig, | ||
| TokenizerConfig, | ||
| TrainingConfig, | ||
| ) | ||
|
|
||
|
|
||
| def _pretrain_common() -> ConfigContainer: | ||
| """Create a base pre-training ConfigContainer with common defaults for any language model. | ||
|
|
||
| This function returns a ConfigContainer template with sensible defaults. | ||
| The caller MUST set `cfg.model` and `cfg.tokenizer.tokenizer_model` before use. | ||
|
|
||
| Returns: | ||
| ConfigContainer: Base configuration template for pre-training. | ||
| """ | ||
| # Default output directories | ||
| base_output_dir = os.path.join(os.getcwd(), "nemo_experiments") | ||
| run_output_dir = os.path.join(base_output_dir, "default") | ||
| checkpoint_dir = os.path.join(run_output_dir, "checkpoints") | ||
| tensorboard_dir = os.path.join(run_output_dir, "tb_logs") | ||
|
|
||
| # Default optimizer and scheduler | ||
| opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing( | ||
| lr_warmup_iters=500, | ||
| lr_decay_iters=None, # Defaults to train_iters during validation | ||
| max_lr=3e-4, | ||
| min_lr=3e-5, | ||
| ) | ||
|
|
||
| cfg = ConfigContainer( | ||
| # Model - MUST be set by each recipe before use | ||
| model=None, # type: ignore[arg-type] | ||
| # Training config | ||
| train=TrainingConfig( | ||
| train_iters=300000, | ||
| eval_interval=500, | ||
| eval_iters=32, | ||
| global_batch_size=32, | ||
| micro_batch_size=2, | ||
| manual_gc=True, | ||
| manual_gc_interval=100, | ||
| manual_gc_eval=100, | ||
| ), | ||
| # Optimizer and scheduler | ||
| optimizer=opt_cfg, | ||
| scheduler=scheduler_cfg, | ||
| # DDP config - these are the commonly overridden settings | ||
| ddp=DistributedDataParallelConfig( | ||
| check_for_nan_in_grad=True, | ||
| grad_reduce_in_fp32=True, | ||
| overlap_grad_reduce=True, | ||
| overlap_param_gather=True, | ||
| average_in_collective=True, | ||
| data_parallel_sharding_strategy="optim_grads_params", | ||
| use_distributed_optimizer=True, | ||
| ), | ||
| # Dataset config - uses mock data by default | ||
| dataset=GPTDatasetConfig( | ||
| random_seed=1234, | ||
| reset_attention_mask=False, | ||
| reset_position_ids=False, | ||
| eod_mask_loss=False, | ||
| seq_length=4096, | ||
| num_dataset_builder_threads=1, | ||
| blend=None, # Mock data mode | ||
| blend_per_split=None, | ||
| split="9999,8,2", | ||
| data_sharding=True, | ||
| dataloader_type="single", | ||
| skip_getting_attention_mask_from_dataset=True, | ||
| ), | ||
| # Logger config | ||
| logger=LoggerConfig( | ||
| log_interval=10, | ||
| tensorboard_dir=tensorboard_dir, | ||
| log_timers_to_tensorboard=True, | ||
| ), | ||
| # Tokenizer - placeholder, each recipe should set tokenizer_model | ||
| tokenizer=TokenizerConfig( | ||
| tokenizer_type="HuggingFaceTokenizer", | ||
| tokenizer_model=None, # Must be set by each recipe | ||
| ), | ||
| # Checkpoint config | ||
| checkpoint=CheckpointConfig( | ||
| save_interval=500, | ||
| save=checkpoint_dir, | ||
| load=checkpoint_dir, | ||
| ckpt_format="torch_dist", | ||
| fully_parallel_save=True, | ||
| ), | ||
| # RNG config | ||
| rng=RNGConfig(seed=1234), | ||
| # Distributed init config | ||
| dist=DistributedInitConfig(), | ||
| comm_overlap=None, | ||
| # Mixed precision - bf16 by default | ||
| mixed_precision="bf16_mixed", | ||
| ) | ||
|
|
||
| return cfg |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaoyu-33 won't this be an issue since users now have to re-apply other model configs set as default in the recipe?