-
Notifications
You must be signed in to change notification settings - Fork 233
fix: memory optimizations for Nemotron12B 12k seqlen DPO training #926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
7929d39
memory optimizations for Nemotron12B 12k seqlen DPO training
ybgao-nvidia 02bee2a
implement suggested changes
ybgao-nvidia fb8c1bb
add copyright
ybgao-nvidia 87f858b
make lint pass
ybgao-nvidia b637b03
Merge branch 'main' into ybgao/aug13-dpo-12k-memory
ybgao-nvidia 00d5d9a
update configuration key and README
ybgao-nvidia 9be4a2c
fix allocator setting
ybgao-nvidia 63a82de
update readme and lint
ybgao-nvidia a2cdc5a
disable expandable segments by default
ybgao-nvidia 1a38935
Merge branch 'main' into ybgao/aug13-dpo-12k-memory
ybgao-nvidia 34124c8
Merge branch 'main' into ybgao/aug13-dpo-12k-memory
ybgao-nvidia d3c9ad7
Update README.md
ybgao-nvidia ae128cc
remove configure_expandable_segments
ybgao-nvidia caaa87f
Update README.md
ybgao-nvidia 60e2909
fix config schema
ybgao-nvidia 0b2164a
Merge branch 'main' into ybgao/aug13-dpo-12k-memory
ybgao-nvidia 35918c7
add test script
ybgao-nvidia a94b4fb
Merge branch 'main' into ybgao/aug13-dpo-12k-memory
ybgao-nvidia e1d0447
will tests pass now?
ybgao-nvidia 51a2607
make tests pass
ybgao-nvidia e7f2b26
Merge branch 'main' into ybgao/aug13-dpo-12k-memory
ybgao-nvidia 28cae68
include field in logger config
ybgao-nvidia 5562dac
Merge branch 'main' into ybgao/aug13-dpo-12k-memory
ybgao-nvidia 38aca9c
remove expandable segments from v2
ybgao-nvidia 189868b
please pass :(
ybgao-nvidia 7573f6d
Merge branch 'main' into ybgao/aug13-dpo-12k-memory
ybgao-nvidia 3271a08
empty cache
ybgao-nvidia b97abd2
Merge branch 'main' into ybgao/aug13-dpo-12k-memory
ybgao-nvidia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
106 changes: 106 additions & 0 deletions
106
examples/configs/recipes/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| # DPO Algorithm Configuration | ||
| dpo: | ||
| max_num_epochs: 1 | ||
| max_num_steps: 100 | ||
| val_period: 10 | ||
| val_batches: 1 | ||
| val_global_batch_size: 16 | ||
| val_micro_batch_size: 1 | ||
| val_at_start: true | ||
| seed: 42 | ||
|
|
||
| reference_policy_kl_penalty: 0.1 | ||
| preference_average_log_probs: False # whether normalizing log probs according to the sequence length in preference_loss | ||
| sft_average_log_probs: ${.preference_average_log_probs} # whether normalizing log probs according to the sequence length in sft_loss | ||
|
|
||
| preference_loss_weight: 1 # the coefficient of the preference loss | ||
| sft_loss_weight: 0 # the coefficient of the SFT loss | ||
|
|
||
| checkpointing: | ||
| enabled: true | ||
| checkpoint_dir: "results/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long" | ||
| metric_name: "val_loss" | ||
| higher_is_better: false | ||
| keep_top_k: null | ||
| save_period: 50 | ||
| checkpoint_must_save_by: null | ||
|
|
||
| policy: | ||
| model_name: "mistralai/Mistral-Nemo-Instruct-2407" | ||
| tokenizer: | ||
| name: ${policy.model_name} | ||
|
|
||
| # number of preference samples per batch | ||
| # each preference sample corresponds to a pair of chosen and rejected responses | ||
| # so the actual batch size processed by the model is train_global_batch_size * 2 | ||
| train_global_batch_size: 8 | ||
| train_micro_batch_size: 1 | ||
|
|
||
|
|
||
| #logprob_batch_size: ${policy.train_micro_batch_size} | ||
| max_total_sequence_length: 12288 | ||
| precision: "bfloat16" | ||
|
|
||
| dtensor_cfg: | ||
| enabled: true | ||
| cpu_offload: false | ||
| sequence_parallel: false | ||
| activation_checkpointing: true | ||
| tensor_parallel_size: 8 | ||
| context_parallel_size: 1 | ||
| custom_parallel_plan: null | ||
| env_vars: | ||
| PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:64" | ||
|
|
||
| dynamic_batching: | ||
| enabled: false | ||
|
|
||
| sequence_packing: | ||
| enabled: false | ||
|
|
||
| # makes the training sequence length divisible by the tensor parallel size | ||
| # this is useful for sequence parallel training | ||
| make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size} | ||
| max_grad_norm: 1.0 | ||
|
|
||
| optimizer: | ||
| name: "torch.optim.AdamW" | ||
| kwargs: | ||
| lr: 1.0e-6 | ||
| weight_decay: 0.01 | ||
| betas: [0.9, 0.999] | ||
| eps: 1e-8 | ||
| # when using Dtensor, we need to set foreach | ||
| # and fused to False | ||
| foreach: False | ||
| fused: False | ||
|
|
||
| scheduler: | ||
| - name: "torch.optim.lr_scheduler.ConstantLR" | ||
| kwargs: | ||
| factor: 1.0 | ||
| total_iters: 10000000000 | ||
| - milestones: [] | ||
|
|
||
| data: | ||
| dataset_name: "HelpSteer3" | ||
| shuffle: False | ||
| max_input_seq_length: ${policy.max_total_sequence_length} | ||
|
|
||
| logger: | ||
| log_dir: "logs/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long" # Base directory for all logs | ||
| wandb_enabled: false # Make sure you do a ``wandb login [Your API key]'' before running | ||
| tensorboard_enabled: false | ||
| mlflow_enabled: false | ||
| monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard | ||
| num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal | ||
| wandb: | ||
| project: "nemo-rl" | ||
| name: "dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long" | ||
| gpu_monitoring: | ||
| collection_interval: 10 # How often to collect GPU usage metrics (in seconds) | ||
| flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds) | ||
|
|
||
| cluster: | ||
| gpus_per_node: 8 | ||
| num_nodes: 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
40 changes: 40 additions & 0 deletions
40
tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| #!/bin/bash | ||
| SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd) | ||
| source $SCRIPT_DIR/common.env | ||
|
|
||
| # ===== BEGIN CONFIG ===== | ||
| NUM_NODES=1 | ||
| STEPS_PER_RUN=100 | ||
| MAX_STEPS=100 | ||
| NUM_RUNS=$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN )) # Round up | ||
| NUM_MINUTES=45 | ||
| # ===== END CONFIG ===== | ||
|
|
||
| exit_if_max_steps_reached | ||
|
|
||
| # Run the experiment | ||
| cd $PROJECT_ROOT | ||
| uv run examples/run_dpo.py \ | ||
| --config $CONFIG_PATH \ | ||
| dpo.max_num_steps=$MAX_STEPS \ | ||
| logger.log_dir=$LOG_DIR \ | ||
| logger.wandb_enabled=True \ | ||
| logger.wandb.project=nemo-rl \ | ||
| logger.wandb.name=$EXP_NAME \ | ||
| logger.monitor_gpus=True \ | ||
| logger.tensorboard_enabled=True \ | ||
| checkpointing.enabled=True \ | ||
| checkpointing.checkpoint_dir=$CKPT_DIR \ | ||
| $@ \ | ||
| 2>&1 | tee $RUN_LOG | ||
|
|
||
| # Convert tensorboard logs to json | ||
| uv run tests/json_dump_tb_logs.py $LOG_DIR --output_path $JSON_METRICS | ||
|
|
||
| # Only run metrics if the target step is reached | ||
| if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then | ||
| uv run tests/check_metrics.py $JSON_METRICS \ | ||
| 'data["train/loss"]["1"] > 0.6990' \ | ||
| 'data["train/loss"]["1"] < 0.6992' \ | ||
| 'data["train/loss"]["100"] < 0.60' | ||
| fi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.