fix(trainer): supplement dfed770 by adding missing update_weights in … by MarkJoson · Pull Request #469 · rllm-org/rllm

MarkJoson · 2026-04-02T05:21:36Z

…sdk trainer to fix vllm engine weight loss and Ascend PositionEmbedding OOB error

Summary

🐛 Bug Description This MR supplements last week's commit by Star Li (dfed770). In the agent_sdk_trainer, the synchronization of weights to the vLLM rollout engine was missing after the initial checkpoint load. This omission caused the vLLM rollout engine to lose its model weights at startup. At the lower execution level, particularly on the Ascend platform, this misalignment formally manifested as an Out-of-Bounds (OOB) error during the PositionEmbedding operator calculation.

🛠️ Fix Implemented Explicitly added self.checkpoint_manager.update_weights() immediately following self._load_checkpoint() during the initialization phase in rllm/trainer/verl/agent_sdk_trainer.py. This ensures that the rollout engine correctly receives and acts on the latest model weights before the initial val_before_train and subsequent trajectory generation steps.

🔗 Related

Follows up on commit: dfed770

Type of change

What changed

rllm/trainer/verl/agent_sdk_trainer.py

…sdk trainer to fix vllm engine weight loss and Ascend PositionEmbedding OOB error

kylemontgomery1 · 2026-04-04T00:31:39Z

@MarkJoson Can you remove the dashboard code and just leave the changes rllm/trainer/verl/agent_sdk_trainer.py?

MarkJoson · 2026-04-04T02:54:57Z

@MarkJoson Can you remove the dashboard code and just leave the changes rllm/trainer/verl/agent_sdk_trainer.py?

Sorry about that — I accidentally committed/pushed to the wrong branch and it included some dashboard-related changes. I’ll clean this up and update the PR so it only contains the changes to rllm/trainer/ver1/agent_sdk_trainer.py.

MarkJoson · 2026-04-04T03:04:37Z

@MarkJoson Can you remove the dashboard code and just leave the changes rllm/trainer/verl/agent_sdk_trainer.py?

I removed the dashboard changes and added two extra fixes to round out the original change. The PR now only touches rllm/trainer/ver1/agent_sdk_trainer.py.

kylemontgomery1 · 2026-04-04T18:28:24Z

Thanks!

fix(trainer): supplement dfed770 by adding missing update_weights in …

5b54789

…sdk trainer to fix vllm engine weight loss and Ascend PositionEmbedding OOB error

MarkJoson force-pushed the fix-sdk-rollout-engine-crash branch from d6b90b4 to 5b54789 Compare April 4, 2026 02:59

additional fixes of sdk trainer

af297ca

kylemontgomery1 merged commit 19618b2 into rllm-org:main Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trainer): supplement dfed770 by adding missing update_weights in …#469

fix(trainer): supplement dfed770 by adding missing update_weights in …#469
kylemontgomery1 merged 2 commits intorllm-org:mainfrom
MarkJoson:fix-sdk-rollout-engine-crash

MarkJoson commented Apr 2, 2026 •

edited

Loading

Uh oh!

kylemontgomery1 commented Apr 4, 2026

Uh oh!

MarkJoson commented Apr 4, 2026

Uh oh!

MarkJoson commented Apr 4, 2026 •

edited

Loading

Uh oh!

kylemontgomery1 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MarkJoson commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

What changed

Uh oh!

kylemontgomery1 commented Apr 4, 2026

Uh oh!

MarkJoson commented Apr 4, 2026

Uh oh!

MarkJoson commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylemontgomery1 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MarkJoson commented Apr 2, 2026 •

edited

Loading

MarkJoson commented Apr 4, 2026 •

edited

Loading