vLLM native OpenAI compatible server with weight syncing by BjarniHaukur · Pull Request #63 · PrimeIntellect-ai/verifiers

BjarniHaukur · 2025-05-26T17:50:12Z

This PR adds a new vllm_serve_async.py script to verifiers. It adds:

a fully featured OpenAI compatible endpoint
that mirrors the weight syncing logic from vllm_serve.py
while delegating endpoint complexity to vllm.entrypoints.openai.api_server

CLAassistant · 2025-05-26T17:50:19Z

All committers have signed the CLA.

BjarniHaukur · 2025-05-26T17:58:11Z

I commented out the data parallel tests. I'm baffled as to why they don't work here. I have an open PR to trl where they pass without fail (trl pr)

If this is something you want to merge, I can take another look.

willccbb · 2025-06-02T23:57:02Z

i ended up implementing my own dynamic batching server which builds around LLM() but exposes an async endpoint, i never found AsyncLLM to be reliable for weight-syncing in training runs and it seems like the general guidance is that isn't fully supported without a custom build config/container (e.g. how it's handled in veRL and related libraries)

at some point we'll migrate to AsyncLLM but the current solution works well for now (perf tests are fairly close in my measurements) and offers a bit more control for error handling.

lewtun · 2025-07-21T19:51:18Z

i never found AsyncLLM to be reliable for weight-syncing in training runs

Out of curiosity, what kind of errors did you hit with AsyncLLM?

* Implement HTTPMonitor to send node status and training progress (PrimeIntellect-ai#17) * Implement HTTPMonitor to send node status and training progress to generic HTTP Server * Address PR comments * Track stage of training in monitor * fix logger in http monitor * make default monitor wandb * Send IP information to HTTPMonitor * Fix ruff issues * Separate metric logger and monitors * Minor bug fix * Revert metric_logger setup to initial impl * Update monitor config setup --------- Co-authored-by: Sami Jaghouar <sami.jaghouar@hotmail.fr> * Fix bug where HTTP Monitor wasn't handling async funcs properly (PrimeIntellect-ai#38) * Fix minor rebase issue * Fix ruff issues * Fix rebase issues --------- Co-authored-by: Sami Jaghouar <sami.jaghouar@hotmail.fr>

BjarniHaukur added 2 commits May 26, 2025 17:42

add AsyncLLM

65900f1

normal / tp working, testing another machine

f35fd83

willccbb closed this Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM native OpenAI compatible server with weight syncing#63

vLLM native OpenAI compatible server with weight syncing#63
BjarniHaukur wants to merge 2 commits into
PrimeIntellect-ai:mainfrom
BjarniHaukur:main

BjarniHaukur commented May 26, 2025

Uh oh!

CLAassistant commented May 26, 2025 •

edited

Loading

Uh oh!

BjarniHaukur commented May 26, 2025

Uh oh!

willccbb commented Jun 2, 2025

Uh oh!

lewtun commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

BjarniHaukur commented May 26, 2025

Uh oh!

CLAassistant commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BjarniHaukur commented May 26, 2025

Uh oh!

willccbb commented Jun 2, 2025

Uh oh!

lewtun commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented May 26, 2025 •

edited

Loading