Skip to content

Fix OpenAI ChatCompletion Ignore stop from FastChat Conv Template #1503

Closed
xingyaoww wants to merge 2 commits into
vllm-project:mainfrom
xingyaoww:fix-chatcompletion-ignore-conv-template-stop
Closed

Fix OpenAI ChatCompletion Ignore stop from FastChat Conv Template #1503
xingyaoww wants to merge 2 commits into
vllm-project:mainfrom
xingyaoww:fix-chatcompletion-ignore-conv-template-stop

Conversation

@xingyaoww
Copy link
Copy Markdown

When using vLLM's OpenAI API server to serve models, I find that the ChatCompletion request by default does not honor the stop_token_ids and stop_str set by FastChat conversation templates. It will cause issues (model keep generating irrelevant stuff) when using vLLM served OpenAI API as an OpenAI compatible server for Gradio interface of FastChat.

This PR add a check in OpenAI ChatCompletion request to make sure the stop_token_ids and stop_str are merged with the request before send it to generation, similar to the implementation of FastChat vllm_worker.

@Tostino
Copy link
Copy Markdown
Contributor

Tostino commented Oct 29, 2023

I replaced the implementation of this function already in: #1493

That should be the way to solve this anyways. FastChat is a hack for formatting.

@WoosukKwon WoosukKwon closed this Mar 13, 2024
amy-why-3459 pushed a commit to amy-why-3459/vllm that referenced this pull request Sep 15, 2025
…t#1503)

### What this PR does / why we need it?

This pull request introduces full-graph capture, replacing the previous
piecewise-graph approach. Key improvements include:

* **Reduced dispatch latency:** By capturing the entire model execution
graph at once, we minimize overhead compared to multiple smaller
captures.
* **Stabilized multi-GPU performance:** Eliminates throughput
fluctuations during the `MODEL_EXECUTE` phase across multiple cards.
* **Stream resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured concurrently.
**Known issues:**

1. Capturing larger or more numerous graphs increases GPU memory usage,
which can lead to OOM errors or inference hangs.
2. The new paged-attention implementation relies on the FIA operator,
which in certain workloads is slower than the previous
approach—resulting in a regression in end-to-end throughput.
There may be other undiscovered corner cases. This PR is the first in a
planned series; we will continue to iterate on and address any remaining
issues in subsequent submissions.

### Does this PR introduce _any_ user-facing change?

```python
compilation_config={
    "full_cuda_graph": True,
},
```
### How was this patch tested?

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants