Fix OpenAI ChatCompletion Ignore stop from FastChat Conv Template by xingyaoww · Pull Request #1503 · vllm-project/vllm

xingyaoww · 2023-10-29T18:44:19Z

When using vLLM's OpenAI API server to serve models, I find that the ChatCompletion request by default does not honor the stop_token_ids and stop_str set by FastChat conversation templates. It will cause issues (model keep generating irrelevant stuff) when using vLLM served OpenAI API as an OpenAI compatible server for Gradio interface of FastChat.

This PR add a check in OpenAI ChatCompletion request to make sure the stop_token_ids and stop_str are merged with the request before send it to generation, similar to the implementation of FastChat vllm_worker.

Tostino · 2023-10-29T20:47:22Z

I replaced the implementation of this function already in: #1493

That should be the way to solve this anyways. FastChat is a hack for formatting.

…t#1503) ### What this PR does / why we need it? This pull request introduces full-graph capture, replacing the previous piecewise-graph approach. Key improvements include: * **Reduced dispatch latency:** By capturing the entire model execution graph at once, we minimize overhead compared to multiple smaller captures. * **Stabilized multi-GPU performance:** Eliminates throughput fluctuations during the `MODEL_EXECUTE` phase across multiple cards. * **Stream resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured concurrently. **Known issues:** 1. Capturing larger or more numerous graphs increases GPU memory usage, which can lead to OOM errors or inference hangs. 2. The new paged-attention implementation relies on the FIA operator, which in certain workloads is slower than the previous approach—resulting in a regression in end-to-end throughput. There may be other undiscovered corner cases. This PR is the first in a planned series; we will continue to iterate on and address any remaining issues in subsequent submissions. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "full_cuda_graph": True, }, ``` ### How was this patch tested? --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

xingyaoww added 2 commits October 29, 2023 18:27

add stop tokens from conv template for chat completion

8f5a827

fix line too long

8264204

WoosukKwon closed this Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix OpenAI ChatCompletion Ignore stop from FastChat Conv Template #1503

Fix OpenAI ChatCompletion Ignore stop from FastChat Conv Template #1503
xingyaoww wants to merge 2 commits into
vllm-project:mainfrom
xingyaoww:fix-chatcompletion-ignore-conv-template-stop

xingyaoww commented Oct 29, 2023

Uh oh!

Tostino commented Oct 29, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

xingyaoww commented Oct 29, 2023

Uh oh!

Tostino commented Oct 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tostino commented Oct 29, 2023 •

edited

Loading