Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/model-quirks.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ NeMo-RL uses the vLLM V1 runtime for both synchronous and asynchronous inference

- NeMo-RL implemented this feature based on torch CP [implementation](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/experimental/_attention.py). And we inherit its limitations.
Whether model level support CP only depends on arguments passed to `torch.nn.functional.scaled_dot_product_attention`. Current NeMo-RL passed all ones attention mask to `model.forward`. For Gemma-3, it won't ignore attention mask as result `attn_bias` is not None which is not supported by torch CP. Please see [assertion](https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/tensor/experimental/_attention.py#L262) .
- Context parallel can't be used together with sequence packing. Sequence packing requires `attn_implementation="flash_attention_2"`, this conflict with context parallel requires SDPA impl. Refer to [here](https://github.com/huggingface/transformers/blob/bda75b4011239d065de84aa3e744b67ebfa7b245/src/transformers/modeling_utils.py#L2317) for more details.


- It's a known issue that context parallel can't be used together with sequence parallel.
Refer to [here](https://github.com/NVIDIA-NeMo/RL/issues/659) for more details.

- It's a known issue that context parallel can't be used together with sequence parallel.
Refer to [here](https://github.com/NVIDIA-NeMo/RL/issues/659) for more details.
Expand Down
3 changes: 3 additions & 0 deletions examples/configs/dpo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,9 @@ policy:
dynamic_batching:
enabled: false

sequence_packing:
enabled: false

# makes the training sequence length divisible by the tensor parallel size
# this is useful for sequence parallel training
make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
Expand Down
3 changes: 3 additions & 0 deletions examples/configs/grpo-deepscaler-1.5b-8K.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ policy:
dynamic_batching:
enabled: False

sequence_packing:
enabled: False

# makes the training sequence length divisible by the tensor parallel size
# this is useful for sequence parallel training
make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
Expand Down
3 changes: 3 additions & 0 deletions examples/configs/grpo_deepscaler-1.5b-24K.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ policy:
dynamic_batching:
enabled: False

sequence_packing:
enabled: False

optimizer:
name: "torch.optim.AdamW"
kwargs:
Expand Down
10 changes: 10 additions & 0 deletions examples/configs/grpo_math_1B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,16 +51,26 @@ policy:
tensor_parallel_size: 1
context_parallel_size: 1
custom_parallel_plan: null

megatron_cfg:
enabled: false

# dynamic_batching improves performance by ensuring logprob and training microbatches
# have a sufficent number of tokens to maximize GPU utilization. Specifically, variable length
# responses are sorted by sequence length and bucketed into microbatches with a total
# amount of tokens is approximately close to 'train_mb_tokens' and 'logprob_mb_tokens' for the
# training and logprob stages respectively.
dynamic_batching:
enabled: False
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64

sequence_packing:
enabled: True
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
algorithm: "modified_first_fit_decreasing"
sequence_length_round: 64

# makes the training sequence length divisible by the tensor parallel size
Expand Down
9 changes: 7 additions & 2 deletions examples/configs/grpo_math_1B_megatron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,19 @@ policy:
# responses are sorted by sequence length and bucketed into microbatches with a total
# amount of tokens is approximately close to 'train_mb_tokens' and 'logprob_mb_tokens' for the
# training and logprob stages respectively.
#
# We disable it for Megatron as it is incompatible with Pipeline parallelism. Instead, we use sequence packing
dynamic_batching:
enabled: False
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64

sequence_packing:
enabled: False # coming soon
enabled: True
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
algorithm: "modified_ffd"
algorithm: "modified_first_fit_decreasing"
sequence_length_round: 64

max_grad_norm: 1.0
Expand Down
2 changes: 1 addition & 1 deletion examples/configs/grpo_math_8B_megatron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,4 +72,4 @@ policy:

cluster:
gpus_per_node: 8
num_nodes: 1
num_nodes: 1
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,10 @@ policy:
custom_parallel_plan: null

dynamic_batching:
enabled: False
enabled: false

sequence_packing:
enabled: false

make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
max_grad_norm: 1.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,10 @@ policy:
custom_parallel_plan: null

dynamic_batching:
enabled: False
enabled: false

sequence_packing:
enabled: false

make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
max_grad_norm: 1.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,10 @@ policy:
enabled: false

dynamic_batching:
enabled: False
enabled: false

sequence_packing:
enabled: false

make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size}
max_grad_norm: 1.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,10 @@ policy:
enabled: false

dynamic_batching:
enabled: False
enabled: false

sequence_packing:
enabled: false

make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size}
max_grad_norm: 1.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,10 @@ policy:
custom_parallel_plan: null

dynamic_batching:
enabled: False
enabled: false

sequence_packing:
enabled: false

make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
max_grad_norm: 1.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ policy:
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 1
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ policy:
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 8
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ policy:
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 1
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ policy:
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 1
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ policy:
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 8
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ policy:
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 8
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ policy:
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 4
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ policy:
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
sequence_length_round: 64
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 1
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ policy:
context_parallel_size: 1
custom_parallel_plan: null
dynamic_batching:
enabled: False
enabled: false
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 1
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ policy:
context_parallel_size: 1
custom_parallel_plan: null
dynamic_batching:
enabled: False
enabled: false
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 2
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@ policy:
dtensor_cfg:
enabled: false
dynamic_batching:
enabled: False
enabled: false
sequence_packing:
enabled: false
make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size}
max_grad_norm: 1
optimizer: null
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ policy:
context_parallel_size: 1
custom_parallel_plan: null
dynamic_batching:
enabled: False
enabled: false
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 1
max_grad_norm: 1
optimizer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ policy:
context_parallel_size: 1
custom_parallel_plan: null
dynamic_batching:
enabled: False
enabled: false
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 8
max_grad_norm: 1
optimizer:
Expand Down
8 changes: 7 additions & 1 deletion examples/configs/sft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,12 @@ policy:
dynamic_batching:
enabled: false

sequence_packing:
enabled: False
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
algorithm: "modified_first_fit_decreasing"
sequence_length_round: 64

# makes the training sequence length divisible by the tensor parallel size
# this is useful for sequence parallel training
make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
Expand Down Expand Up @@ -121,7 +127,7 @@ policy:
average_in_collective: true
data_parallel_sharding_strategy: "optim_grads_params"


data:
max_input_seq_length: ${policy.max_total_sequence_length}
dataset_name: "squad"
Expand Down
3 changes: 3 additions & 0 deletions examples/configs/sft_openmathinstruct2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ policy:
dynamic_batching:
enabled: false

sequence_packing:
enabled: false

# makes the training sequence length divisible by the tensor parallel size
# this is useful for sequence parallel training
make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
Expand Down
2 changes: 2 additions & 0 deletions examples/run_sft.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@
from nemo_rl.utils.config import load_config, parse_hydra_overrides
from nemo_rl.utils.logger import get_next_experiment_dir

OmegaConf.register_new_resolver("mul", lambda a, b: a * b)


def parse_args():
"""Parse command line arguments."""
Expand Down
Loading
Loading