Qwen3-VL Sequence Packing Example scripts#2380
Conversation
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test afe227a |
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test d45ca41 |
📝 WalkthroughWalkthroughThis PR adds sequence-packed parameter-efficient fine-tuning support for Qwen3-VL by introducing documentation, a new bash script for orchestrating LoRA finetuning experiments with sequence packing configurations, registering a new forward step function in the training dispatcher, and modifying the forward step to reshape loss_mask during sequence packing. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/training/run_recipe.py (1)
109-109:⚠️ Potential issue | 🟡 MinorHelp text for
--step_funcdoesn't mention the newqwen3_vl_stepoption.The help string still only lists
gpt_step,vlm_step, andllava_step. Users won't discoverqwen3_vl_stepfrom--help.Proposed fix
- help="Step function: gpt_step (text-only), vlm_step (vision-language), or llava_step (LLaVA models)", + help="Step function: gpt_step (text-only), vlm_step (vision-language), llava_step (LLaVA models), or qwen3_vl_step (Qwen3 VL models)",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/training/run_recipe.py` at line 109, Update the help string for the --step_func argument to include the new qwen3_vl_step option so users see it in --help; locate the argparse add_argument call that defines "--step_func" in run_recipe.py (the help parameter mentioning gpt_step, vlm_step, llava_step) and append ", qwen3_vl_step (Qwen-3 vision-language)" or similar to the existing help text.
🧹 Nitpick comments (3)
examples/models/vlm/qwen3_vl/seq_packing.sh (3)
1-1: Addset -euo pipefailfor safer script execution.Per Google Shell Style Guide, scripts should fail early on errors. Without this, a failing training run will be silently ignored and the loop will continue to the next configuration.
Proposed fix
#!/usr/bin/env bash +set -euo pipefail # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.As per coding guidelines:
**/*.sh: Follow Google Shell Style Guide.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/vlm/qwen3_vl/seq_packing.sh` at line 1, The script lacks strict error handling; update seq_packing.sh to enable safe-failure mode by adding a shell option line immediately after the shebang to set -euo pipefail (and optionally set IFS=$'\n\t') so the script exits on errors, treats unset variables as failures, and propagates pipe errors; place this change at the top of seq_packing.sh (right after #!/usr/bin/env bash) so all subsequent commands run under these safer semantics.
74-87: Hyperparameters are duplicated between dense and MoE sections.Lines 22–31 and 77–86 define identical values for
DATASET_NAME,SEQ_LENGTH,TRAIN_ITERS,GLOBAL_BATCH_SIZE,MICRO_BATCH_SIZE,EVAL_ITERS,LR,MIN_LR,LR_WARMUP_ITERS,LOG_INTERVAL, andWANDB_PROJECT. Consider defining shared defaults once at the top and only overriding what differs (checkpoint path, model name, parallelism configs).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/vlm/qwen3_vl/seq_packing.sh` around lines 74 - 87, Move the duplicated hyperparameter definitions into a single shared defaults block and reference those variables in both the dense and MoE sections instead of redefining them; specifically keep only differing values like PRETRAINED_CHECKPOINT and MODEL_NAME (and any parallelism-specific vars) in each section and remove the repeated definitions of DATASET_NAME, SEQ_LENGTH, TRAIN_ITERS, GLOBAL_BATCH_SIZE, MICRO_BATCH_SIZE, EVAL_ITERS, LR, MIN_LR, LR_WARMUP_ITERS, LOG_INTERVAL, and WANDB_PROJECT so the top-level defaults are used and individual sections only override what changes.
43-43: Useuv runinstead ofpythondirectly.The coding guidelines require using
uv runto execute scripts instead of activating a virtual environment and callingpythondirectly. Both invocations on lines 43 and 98 should be updated.Proposed fix
- python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \ + uv run -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \As per coding guidelines:
{**/*.sh,examples/**/*.py}: Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly.Also applies to: 98-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/vlm/qwen3_vl/seq_packing.sh` at line 43, Replace direct Python invocations with uv run: find the two occurrences of the command "python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py" (the one at line 43 and the one at line 98) and update them to invoke the interpreter via uv run (e.g., "uv run -- python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py") so the script is executed through uv rather than calling python directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 286-287: The reshape of forward_args["loss_mask"] is missing a
None guard and will raise on non-last pipeline-parallel stages; mirror the
existing guard used for forward_args["labels"] (see the check around the labels
reshape in qwen3_vl_step.py) by wrapping the line that does
forward_args["loss_mask"] = forward_args["loss_mask"].reshape(1, -1) in an if
forward_args["loss_mask"] is not None: block so loss_mask is only reshaped when
present.
---
Outside diff comments:
In `@scripts/training/run_recipe.py`:
- Line 109: Update the help string for the --step_func argument to include the
new qwen3_vl_step option so users see it in --help; locate the argparse
add_argument call that defines "--step_func" in run_recipe.py (the help
parameter mentioning gpt_step, vlm_step, llava_step) and append ", qwen3_vl_step
(Qwen-3 vision-language)" or similar to the existing help text.
---
Nitpick comments:
In `@examples/models/vlm/qwen3_vl/seq_packing.sh`:
- Line 1: The script lacks strict error handling; update seq_packing.sh to
enable safe-failure mode by adding a shell option line immediately after the
shebang to set -euo pipefail (and optionally set IFS=$'\n\t') so the script
exits on errors, treats unset variables as failures, and propagates pipe errors;
place this change at the top of seq_packing.sh (right after #!/usr/bin/env bash)
so all subsequent commands run under these safer semantics.
- Around line 74-87: Move the duplicated hyperparameter definitions into a
single shared defaults block and reference those variables in both the dense and
MoE sections instead of redefining them; specifically keep only differing values
like PRETRAINED_CHECKPOINT and MODEL_NAME (and any parallelism-specific vars) in
each section and remove the repeated definitions of DATASET_NAME, SEQ_LENGTH,
TRAIN_ITERS, GLOBAL_BATCH_SIZE, MICRO_BATCH_SIZE, EVAL_ITERS, LR, MIN_LR,
LR_WARMUP_ITERS, LOG_INTERVAL, and WANDB_PROJECT so the top-level defaults are
used and individual sections only override what changes.
- Line 43: Replace direct Python invocations with uv run: find the two
occurrences of the command "python -m torch.distributed.run --nproc_per_node=8
scripts/training/run_recipe.py" (the one at line 43 and the one at line 98) and
update them to invoke the interpreter via uv run (e.g., "uv run -- python -m
torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py") so the
script is executed through uv rather than calling python directly.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: kamran-nvidia <kjafarisadeg@nvidia.com>
…urations Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
…Megatron-Bridge into kamran/qwen3_vl_seq_packing Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
Signed-off-by: Kamran Jafari <kjafarisadeg@nvidia.com>
|
/ok to test e36e86b |
What does this PR do ?
Adds Qwen 3 VL script for testing Sequence Packing (
examples/models/vlm/qwen3_vl/)Results for Dense and MoE models with different Context Parallel sizes below:
Changelog
Adds Qwen 3 VL scripts for testing Sequence Packing under (
examples/models/vlm/qwen3_vl/):peft_seq_packed.sh: Script to verify sequence packing with LoRA fine-tuning and different CPspeft_seq_unpacked.sh: Script to verify different parallelism configs with LoRA fine-tuningsft_seq_packed.sh: Script to verify sequence packing with full fine-tuning and different CPssft_seq_unpacked.sh: Script to verify different parallelism configs with full fine-tuningGitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Documentation
New Features