-
Notifications
You must be signed in to change notification settings - Fork 176
feat: VLM support via megatron backend #1115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 7 commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
ac36c67
Fix get_ltor_masks_and_position_ids regression
yfw 5ae74db
Another fix
yfw b5adb33
ruff
yfw 7b4ecc6
Initial megatron vlm support
yfw ba03346
Fix to get sequence packing working
yfw 8b47c39
Upgrade to transformers==4.55.4
yfw 8c034be
Merge remote-tracking branch 'origin' into yifu/vlm_mcore
yfw 7c284f0
Reorder megatron-bridge commits
yfw e361dda
Fix megatron-bridge
yfw f9730c8
Support truncating+skipping long sequences for vlm
yfw df9d546
Fix edge case for vllm
yfw d9bf20d
Fix typing
yfw 88e518c
Update pyproject.toml
yfw ba90ff3
Merge remote-tracking branch 'origin' into yifu/vlm_mcore
yfw 3717c30
Merge remote-tracking branch 'origin' into yifu/vlm_mcore
yfw 1365d56
uv.lock and minimize cfg
yfw 8717101
lint
yfw f625913
Add smolvlm test
yfw 07b32b7
Copyright
yfw 6ad5236
Add vlm mcore nightly
yfw c4df614
Newline
yfw fcc12ee
Fix tests
yfw 1aa3668
Fix tests
yfw 1677c38
Add tests
yfw b32531a
Add test
yfw 4cb0a70
Set position_ids to None in multimodal case
yfw File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Submodule Megatron-Bridge
updated
19 files
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,183 @@ | ||
| # GRPO Algorithm Configuration | ||
| defaults: "vlm_grpo_3B.yaml" | ||
|
|
||
| policy: | ||
| model_name: "Qwen/Qwen2.5-VL-3B-Instruct" | ||
| tokenizer: | ||
| name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from the model's default | ||
| train_global_batch_size: 128 | ||
| train_micro_batch_size: 1 | ||
| generation_batch_size: 32 # Only used when generating using HF backend | ||
| logprob_batch_size: 4 | ||
| max_total_sequence_length: 2048 | ||
| precision: "bfloat16" | ||
|
|
||
| dtensor_cfg: | ||
| enabled: false | ||
|
|
||
| # See docs/design-docs/sequence-packing-and-dynamic-batching.md | ||
| # for more details on dynamic batching and sequence packing. | ||
| # | ||
| # We disable dynamic batching for Megatron as it is incompatible with Pipeline parallelism. | ||
| # Instead, we use sequence packing. | ||
| dynamic_batching: | ||
| enabled: False | ||
| train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}} | ||
| logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}} | ||
| sequence_length_round: 64 | ||
|
|
||
| sequence_packing: | ||
| enabled: False | ||
| train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}} | ||
| logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}} | ||
| algorithm: "modified_first_fit_decreasing" | ||
| sequence_length_round: 64 | ||
|
|
||
| max_grad_norm: 1.0 | ||
| # makes the training sequence length divisible by the tensor parallel size | ||
| # this is useful for sequence parallel training | ||
| make_sequence_length_divisible_by: ${policy.megatron_cfg.tensor_model_parallel_size} | ||
|
|
||
| optimizer: null # remove default FSDP optimizer | ||
|
|
||
| megatron_cfg: | ||
| enabled: true | ||
| empty_unused_memory_level: 0 | ||
| activation_checkpointing: false | ||
| converter_type: "Qwen2ForCausalLM" | ||
| tensor_model_parallel_size: 1 | ||
| expert_tensor_parallel_size: 1 | ||
| expert_model_parallel_size: 1 | ||
| pipeline_model_parallel_size: 1 | ||
| num_layers_in_first_pipeline_stage: null | ||
| num_layers_in_last_pipeline_stage: null | ||
| context_parallel_size: 1 | ||
| pipeline_dtype: ${policy.precision} | ||
| sequence_parallel: false | ||
| freeze_moe_router: true | ||
| moe_router_dtype: "fp64" | ||
| moe_router_load_balancing_type: "none" # "seq_aux_loss" causes logprob error divergence for grpo | ||
| moe_router_bias_update_rate: 0.0 # by default, disable bias updates for grpo | ||
| moe_permute_fusion: false | ||
| #gives ~20% training perf speedup with sequence packing | ||
| apply_rope_fusion: True | ||
|
|
||
| optimizer: | ||
| optimizer: "adam" | ||
| lr: 2.0e-7 | ||
| min_lr: 2.0e-7 | ||
| weight_decay: 0.01 | ||
| bf16: true | ||
| fp16: false | ||
| params_dtype: "float32" | ||
|
|
||
| #adam | ||
| adam_beta1: 0.9 | ||
| adam_beta2: 0.999 | ||
| adam_eps: 1e-8 | ||
|
|
||
| #sgd | ||
| sgd_momentum: 0.9 | ||
|
|
||
| #distributed optimizer | ||
| use_distributed_optimizer: true | ||
| use_precision_aware_optimizer: true | ||
|
|
||
| clip_grad: ${policy.max_grad_norm} | ||
|
|
||
| scheduler: | ||
| start_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay} | ||
| end_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay} | ||
| weight_decay_incr_style: "constant" | ||
| lr_decay_style: "constant" | ||
| lr_decay_iters: 1000 | ||
| lr_warmup_iters: 50 | ||
| lr_warmup_init: 2.0e-8 | ||
|
|
||
| distributed_data_parallel_config: | ||
| grad_reduce_in_fp32: false | ||
| overlap_grad_reduce: false | ||
| overlap_param_gather: true | ||
| average_in_collective: true | ||
| use_custom_fsdp: false | ||
| data_parallel_sharding_strategy: "optim_grads_params" | ||
|
|
||
| generation: | ||
| backend: "vllm" | ||
| # max_new_tokens: ${policy.max_total_sequence_length} | ||
| max_new_tokens: 1024 | ||
| temperature: 1.0 | ||
| top_p: 1.0 | ||
| top_k: null | ||
| stop_token_ids: null | ||
| stop_strings: null | ||
| vllm_cfg: | ||
| async_engine: false # Only for internal testing, will be enabled by https://github.com/NVIDIA/NeMo-RL/issues/447. | ||
| precision: ${policy.precision} | ||
| tensor_parallel_size: 1 | ||
| pipeline_parallel_size: 1 | ||
| enable_expert_parallel: false | ||
| gpu_memory_utilization: 0.6 | ||
| max_model_len: ${policy.max_total_sequence_length} | ||
| enforce_eager: False | ||
| colocated: | ||
| # true: generation shares training GPUs | ||
| # false: uses dedicated generation resources | ||
| enabled: true | ||
| # only relevant when enabled is false | ||
| resources: | ||
| gpus_per_node: null # Decides num gpus to be dedicated to generation when there is one node in the cluster i.e cluster.num_nodes == 1 | ||
| num_nodes: null # Decides number of nodes to be dedicated to generation | ||
|
|
||
| data: | ||
| max_input_seq_length: ${policy.max_total_sequence_length} # upper bound, real truncation occurs at vllm.max_model_len | ||
| prompt_file: "examples/prompts/clevr_cogent_cot.txt" | ||
| system_prompt_file: null | ||
| dataset_name: "clevr-cogent" | ||
| split: "trainA" | ||
| shuffle: true | ||
|
|
||
| env: | ||
| clevr-cogent: | ||
| num_workers: 8 | ||
| reward_functions: | ||
| - name: format | ||
| weight: 0.2 | ||
| - name: exact_alnum | ||
| weight: 0.8 | ||
| geometry3k: | ||
| num_workers: 8 | ||
| reward_functions: | ||
| - name: format | ||
| weight: 0.1 | ||
| - name: math_expr | ||
| weight: 0.9 | ||
| refcoco: | ||
| num_workers: 8 | ||
| reward_functions: | ||
| - name: format | ||
| weight: 0.1 | ||
| - name: bbox_giou | ||
| weight: 0.9 | ||
| kwargs: | ||
| giou_penalty_thres: 0.5 | ||
|
|
||
| logger: | ||
| log_dir: "logs" # Base directory for all logs | ||
| num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal | ||
| wandb_enabled: false | ||
| tensorboard_enabled: true | ||
| mlflow_enabled: false # Disable MLflow logging | ||
| monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard | ||
| wandb: | ||
| project: "grpo-dev" | ||
| name: "vlm-grpo-3b-megatron" | ||
| tensorboard: {} | ||
| gpu_monitoring: | ||
| collection_interval: 10 # How often to collect GPU usage metrics (in seconds) | ||
| flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds) | ||
|
|
||
|
|
||
| cluster: | ||
| gpus_per_node: 2 | ||
| num_nodes: 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.