Skip to content

fix(trainer): supplement dfed770 by adding missing update_weights in …#469

Merged
kylemontgomery1 merged 2 commits intorllm-org:mainfrom
MarkJoson:fix-sdk-rollout-engine-crash
Apr 4, 2026
Merged

fix(trainer): supplement dfed770 by adding missing update_weights in …#469
kylemontgomery1 merged 2 commits intorllm-org:mainfrom
MarkJoson:fix-sdk-rollout-engine-crash

Conversation

@MarkJoson
Copy link
Copy Markdown
Contributor

@MarkJoson MarkJoson commented Apr 2, 2026

…sdk trainer to fix vllm engine weight loss and Ascend PositionEmbedding OOB error

Summary

🐛 Bug Description This MR supplements last week's commit by Star Li (dfed770). In the agent_sdk_trainer, the synchronization of weights to the vLLM rollout engine was missing after the initial checkpoint load. This omission caused the vLLM rollout engine to lose its model weights at startup. At the lower execution level, particularly on the Ascend platform, this misalignment formally manifested as an Out-of-Bounds (OOB) error during the PositionEmbedding operator calculation.

🛠️ Fix Implemented Explicitly added self.checkpoint_manager.update_weights() immediately following self._load_checkpoint() during the initialization phase in rllm/trainer/verl/agent_sdk_trainer.py. This ensures that the rollout engine correctly receives and acts on the latest model weights before the initial val_before_train and subsequent trajectory generation steps.

🔗 Related

Follows up on commit: dfed770

Type of change

  • Feature
  • [√] Fix
  • Docs
  • Refactor
  • Example / Project
  • Infra / CI

What changed

  • rllm/trainer/verl/agent_sdk_trainer.py

…sdk trainer to fix vllm engine weight loss and Ascend PositionEmbedding OOB error
@kylemontgomery1
Copy link
Copy Markdown
Collaborator

@MarkJoson Can you remove the dashboard code and just leave the changes rllm/trainer/verl/agent_sdk_trainer.py?

@MarkJoson
Copy link
Copy Markdown
Contributor Author

@MarkJoson Can you remove the dashboard code and just leave the changes rllm/trainer/verl/agent_sdk_trainer.py?

Sorry about that — I accidentally committed/pushed to the wrong branch and it included some dashboard-related changes. I’ll clean this up and update the PR so it only contains the changes to rllm/trainer/ver1/agent_sdk_trainer.py.

@MarkJoson MarkJoson force-pushed the fix-sdk-rollout-engine-crash branch from d6b90b4 to 5b54789 Compare April 4, 2026 02:59
@MarkJoson
Copy link
Copy Markdown
Contributor Author

MarkJoson commented Apr 4, 2026

@MarkJoson Can you remove the dashboard code and just leave the changes rllm/trainer/verl/agent_sdk_trainer.py?

I removed the dashboard changes and added two extra fixes to round out the original change. The PR now only touches rllm/trainer/ver1/agent_sdk_trainer.py.

@kylemontgomery1
Copy link
Copy Markdown
Collaborator

Thanks!

@kylemontgomery1 kylemontgomery1 merged commit 19618b2 into rllm-org:main Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants