Skip to content

Conversation

@wuxibin89
Copy link
Collaborator

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

AsyncOpenAI has very severe performance issue due to httpx, replace it to aiohttp client. For train_batch_size=1024, AsyncOpenAI introduces ~25s per generation phase.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if necessary.

Copy link
Collaborator

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just with one small nit question.

timeout=None,
max_retries=0
)
client = AsyncOpenAI(base_url=f"http://{address}/v1", api_key="token-abc123", timeout=None, max_retries=0)
Copy link
Collaborator

@hongpeng-guo hongpeng-guo May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this line is the same as before, but the format changes. Just want to double check if the current one is lint with the pre-commit hook :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's auto format by pre-commit hook.

@vermouth1992 vermouth1992 merged commit 3eaaf24 into main May 20, 2025
40 of 43 checks passed
@vermouth1992 vermouth1992 deleted the wuxibin/async_vllm_perf branch May 20, 2025 03:31
@casper-hansen
Copy link
Contributor

@wuxibin89 I found that this PR reintroduced the problem fixed in #1483 because we switched from httpx to aiohttp which has a default timeout of 5 minutes. Would you mind having a look at this error to fix it? CC @U-rara.

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/verl/workers/rollout/async_server.py", line 188, in submit_chat_completions
    completions = await self._chat_completions_aiohttp(address, **chat_complete_request)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/verl/workers/rollout/async_server.py", line 203, in _chat_completions_aiohttp
    async with session.post(
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/aiohttp/client.py", line 1425, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/aiohttp/client.py", line 730, in _request
    await resp.start(conn)
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1054, in start
    with self._timer:
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/aiohttp/helpers.py", line 685, in __exit__
    raise asyncio.TimeoutError from exc_val
TimeoutError

@casper-hansen casper-hansen mentioned this pull request May 26, 2025
6 tasks
@casper-hansen
Copy link
Contributor

I ended up creating a PR #1702 @U-rara @wuxibin89. Please take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants