Skip to content

vLLM native OpenAI compatible server with weight syncing#63

Closed
BjarniHaukur wants to merge 2 commits into
PrimeIntellect-ai:mainfrom
BjarniHaukur:main
Closed

vLLM native OpenAI compatible server with weight syncing#63
BjarniHaukur wants to merge 2 commits into
PrimeIntellect-ai:mainfrom
BjarniHaukur:main

Conversation

@BjarniHaukur

Copy link
Copy Markdown

This PR adds a new vllm_serve_async.py script to verifiers. It adds:

  • a fully featured OpenAI compatible endpoint
  • that mirrors the weight syncing logic from vllm_serve.py
  • while delegating endpoint complexity to vllm.entrypoints.openai.api_server

image

@CLAassistant

CLAassistant commented May 26, 2025

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@BjarniHaukur

Copy link
Copy Markdown
Author

I commented out the data parallel tests. I'm baffled as to why they don't work here. I have an open PR to trl where they pass without fail (trl pr)

If this is something you want to merge, I can take another look.

@willccbb

willccbb commented Jun 2, 2025

Copy link
Copy Markdown
Member

i ended up implementing my own dynamic batching server which builds around LLM() but exposes an async endpoint, i never found AsyncLLM to be reliable for weight-syncing in training runs and it seems like the general guidance is that isn't fully supported without a custom build config/container (e.g. how it's handled in veRL and related libraries)

at some point we'll migrate to AsyncLLM but the current solution works well for now (perf tests are fairly close in my measurements) and offers a bit more control for error handling.

@willccbb willccbb closed this Jun 2, 2025
@lewtun

lewtun commented Jul 21, 2025

Copy link
Copy Markdown

i never found AsyncLLM to be reliable for weight-syncing in training runs

Out of curiosity, what kind of errors did you hit with AsyncLLM?

ronaldnetawat pushed a commit to ronaldnetawat/verifiers that referenced this pull request Nov 13, 2025
* Implement HTTPMonitor to send node status and training progress (PrimeIntellect-ai#17)

* Implement HTTPMonitor to send node status and training progress to generic HTTP Server

* Address PR comments

* Track stage of training in monitor

* fix logger in http monitor

* make default monitor wandb

* Send IP information to HTTPMonitor

* Fix ruff issues

* Separate metric logger and monitors

* Minor bug fix

* Revert metric_logger setup to initial impl

* Update monitor config setup

---------

Co-authored-by: Sami Jaghouar <sami.jaghouar@hotmail.fr>

* Fix bug where HTTP Monitor wasn't handling async funcs properly (PrimeIntellect-ai#38)

* Fix minor rebase issue

* Fix ruff issues

* Fix rebase issues

---------

Co-authored-by: Sami Jaghouar <sami.jaghouar@hotmail.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants