[megatron] feat: model engine support mtp by ArronHZG · Pull Request #5561 · verl-project/verl

ArronHZG · 2026-03-11T15:43:09Z

What does this PR do?

model engine support mtp

#5323 break use mtp in mbridge, revert.

Unload the KV cache before parameter synchronization (SGLang supports this first).

Throughput increased from an initial 3900 token/s to 4800 token/s, representing a 23% improvement.
The speculative acceptance rate increased from 44% to 54%, representing a 22% improvement.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

gemini-code-assist

Code Review

This pull request introduces support for Multi-Token Prediction (MTP) in the model engine, primarily affecting Megatron-based training and inference. Key changes include refactoring model forward passes to handle MTP-specific preprocessing and postprocessing, updating configuration files to enable MTP, and adding a new example script for MTP training. The changes also include improvements in handling nested tensors and position IDs. However, there are several areas that require attention to improve robustness, maintainability, and portability, particularly concerning hardcoded values, manual configuration steps, and potential behavioral changes in patched functions.

I am having trouble creating individual review comments. Click here to see my feedback.

examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron_4_4.sh (34)

The comment indicates a manual step to modify max_position_embeddings in config.json. Manual intervention for configuration is prone to human error and reduces the reproducibility and automation of the setup. This step should ideally be automated within the script or handled programmatically by the model loading logic.

examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron_4_4.sh (30-31)

The default values for NNODES and NGPUS_PER_NODE are set to 16 and 8 respectively, which are very high for a test script. These are then overridden to 1 and 4 within the fully_async array (lines 77-80). This inconsistency can lead to confusion regarding the actual resource allocation and might cause unexpected resource consumption or errors if the script is run without careful inspection. It is best to define these variables once or clearly indicate which values take precedence.

NNODES=${NNODES:-1}
NGPUS_PER_NODE=${NGPUS_PER_NODE:-4}

examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron_4_4.sh (48-49)

The use of magic numbers 2 and 3 in the calculations for actor_ppo_max_token_len and infer_ppo_max_token_len reduces the readability and maintainability of the script. It would be clearer to define these values as named variables with descriptive names, explaining their purpose.

ACTOR_PPO_FACTOR=2
INFER_PPO_FACTOR=3
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * ACTOR_PPO_FACTOR))
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * INFER_PPO_FACTOR))

examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron_4_4.sh (76)

The calculation 512*100 for rollout.total_rollout_steps is hardcoded. While simple, it would be more explicit and easier to modify if 512 and 100 were defined as variables, or if the resulting value was directly assigned as a literal if it's a fixed constant.

ROLLOUT_BASE_STEPS=512
ROLLOUT_MULTIPLIER=100
rollout.total_rollout_steps=$(((ROLLOUT_BASE_STEPS*ROLLOUT_MULTIPLIER)))

verl/experimental/fully_async_policy/shell/runtime_env_4_4.yaml (14-15)

The paths for RAY_DATA_HOME and TENSORBOARD_DIR are hardcoded to user-specific directories (/home/hadoop-djst-algoplat). This significantly reduces the portability and reusability of this runtime environment configuration across different systems or users. These paths should be made configurable (e.g., via environment variables that can be set externally) or use more generic relative paths.

verl/models/mcore/model_forward.py (191-192)

In the _convert_to_nested_tensor function, if vi.shape[0] < target_len, the tensor is padded with torch.ones. Depending on the context and the data represented by vi (e.g., token IDs), padding with 1 might be semantically incorrect if 1 is a valid token ID. This could lead to unintended model behavior. It would be safer to pad with a specific pad_token_id from the tokenizer or a value that is guaranteed not to interfere with valid data.

            vi = torch.cat([vi, torch.full((target_len - vi.shape[0],), self.pad_token_id, dtype=vi.dtype, device=vi.device)])

verl/models/mcore/mtp_patch.py (78-81)

The refactoring of _megatron_gptmodel_postprocess removes the explicit delegation to self._postprocess_backup for inference paths (when labels is None). While the new logic might cover the training path, it's crucial to verify that the inference behavior remains unchanged. If _postprocess_backup contained specific logic or optimizations for inference that are not replicated in the new combined logic, this change could introduce regressions or alter the model's behavior during evaluation.

### What does this PR do? model engine support mtp verl-project#5323 break use mtp in mbridge, revert. Unload the KV cache before parameter synchronization (SGLang supports this first). <img width="696" height="550" alt="image" src="https://github.com/user-attachments/assets/2aeacab4-b466-4d51-85d0-128b54ff13b2" /> <img width="704" height="580" alt="image" src="https://github.com/user-attachments/assets/08ac4490-c41a-4ddd-b522-6a1539e2e229" /> Throughput increased from an initial **3900 token/s** to **4800 token/s**, representing a **23% improvement**. The speculative acceptance rate increased from 44% to 54%, representing a 22% improvement. <img width="1380" height="610" alt="image" src="https://github.com/user-attachments/assets/51da4d2e-3d12-4a71-8f48-f347e1c71896" /> <img width="2774" height="596" alt="image" src="https://github.com/user-attachments/assets/825084af-ba1e-4d58-ac9e-16f1251fd1e3" /> ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.

ArronHZG added 9 commits March 11, 2026 15:56

revert mtp_patch

1bb4a43

revert mtp_patch

bf4a6b8

DAPO-mimo-7b-rl-megatron

a39609a

model engine mtp

08167a4

thd

450a964

train_prompt_bsz=128

4f9e74a

train_prompt_bsz=128

ff19c3e

ref enable mtp false

5d676c6

rm log

137be6d

ArronHZG requested review from ISEEKYAN, eric-haibin-lin, vermouth1992 and wuxibin89 as code owners March 11, 2026 15:43

ArronHZG changed the title ~~[megatron] model engine support mtp~~ [megatron] feat: model engine support mtp Mar 11, 2026

ArronHZG added 2 commits March 11, 2026 23:47

rm log

43939ff

rm log

0ad43ae

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

ArronHZG added 2 commits March 11, 2026 23:57

update fully_async_policy mtp shell

eb6699f

before sync params, clear kv_cache

1b1c568

ArronHZG requested a review from chenhaiq as a code owner March 12, 2026 10:55

ArronHZG added 2 commits March 13, 2026 10:52

before build_process_group, clear kv_cache

1517e8e

Merge branch 'main' into model_engine_support_mtp

e55a026

wuxibin89 requested a review from HollowMan6 March 13, 2026 05:40

ISEEKYAN approved these changes Mar 13, 2026

View reviewed changes

ArronHZG merged commit 5d73af6 into verl-project:main Mar 13, 2026
72 of 87 checks passed

HollowMan6 mentioned this pull request Mar 14, 2026

[megatron] fix: MTP patch for newer mcore #5587

Open

8 tasks

ArronHZG deleted the model_engine_support_mtp branch March 31, 2026 06:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron] feat: model engine support mtp#5561

[megatron] feat: model engine support mtp#5561
ArronHZG merged 15 commits intoverl-project:mainfrom
meituan-search:model_engine_support_mtp

ArronHZG commented Mar 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArronHZG commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron_4_4.sh (34)

examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron_4_4.sh (30-31)

examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron_4_4.sh (48-49)

examples/mtp_trainer/test_dapo_mimo_7b_with_mtp_math_megatron_4_4.sh (76)

verl/experimental/fully_async_policy/shell/runtime_env_4_4.yaml (14-15)

verl/models/mcore/model_forward.py (191-192)

verl/models/mcore/mtp_patch.py (78-81)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArronHZG commented Mar 11, 2026 •

edited

Loading