Skip to content

Conversation

@comaniac
Copy link
Contributor

Close #26

  • Support OpenAI compatible chat API with and without streaming.
  • Support builtin and dynamic registered chat templates.

@comaniac comaniac requested a review from merrymercy January 19, 2024 01:07
@merrymercy
Copy link
Contributor

For the chat template, can we use the hugging face tokenizer by default? https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates

@comaniac
Copy link
Contributor Author

For the chat template, can we use the hugging face tokenizer by default? https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates

Good idea. Let me add this.

@comaniac
Copy link
Contributor Author

comaniac commented Jan 19, 2024

Now if --chat-template is not specified, we use the tokenizer build-in chat template. Thus, most users should not worry about the chat template for HF models. Meanwhile, I found that some tokenizers such as TinyLlama does not have chat template, so we will get the following when calling apply_chat_template(messages, tokenize=False, add_generation_prompt=True):

No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model
, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

'<s>[INST] <<SYS>>\nYou are a helpful AI assistant\n<</SYS>>\n\nList 3 countries and their capitals. [/INST]'

And this template is actually incorrect for this model, so you will get the following response in the unit test:

 <<SYS>>
You are a helpful AI assistant
<</SYS>>

List 4 cities with more than 100,000 people. [/INST] <<SYS>>
You are a helpful AI assistant
<</SYS>>

List 5

The following response is expected with ChatML template:

Here are three countries and their capitals:

1. Canada - Ottawa
2. Australia - Canberra
3. South Korea - Seoul

Another issue is apply_chat_template doesn't provide stop strings, so the response may be undesired. For example, Llama-2 template should have stop string "[INST]", "[/INST]", "<<SYS>>", "<</SYS>>". In general, I would still suggest specify chat template explicitly, but maybe we could put this issue in the troubleshooting.

@@ -0,0 +1,381 @@
# Adapted from
# https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider importing instead of copying later.

@merrymercy merrymercy merged commit 23471f9 into main Jan 19, 2024
@merrymercy merrymercy deleted the cody/openai-chat branch January 19, 2024 07:43
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
blzheng pushed a commit to blzheng/sglang that referenced this pull request Apr 11, 2025
NorthmanPKU pushed a commit to NorthmanPKU/sglang that referenced this pull request May 16, 2025
…roject#50)

* [Transpiler] Add config class and interface for the transpiler

* [Misc] Add a static checking for STensor::num_elements

* [Transpiler] Implement all kernel operators and their runtime

* Format code

* Add support for broadcast in ElemwiseBinaryKernel

* Refine Python interface for transpiler & Add layout resolve for DTensor

* Finish layout resolution for threadblock level ops

* Add basic support for threadblock level transpiling

* Merge Python JIT frontend (sgl-project#40). Add support for Python JIT frontend

* temp frontend

* bug fix

* bug fix

* frontend with output shapes/strides

* add jit demo

* Add checking for MIRAGE_ROOT

---------

Co-authored-by: Shengyu Liu <[email protected]>

* Add unit test for the transpiler

* Add document for the transpiler

* Add threadblock level matmul operator

* Add threadblock-level reduction op

* clang format

* nits

* Remove nonexist examples

* Merge from main

* Add tb scheduling & Add support for forloop accumulator

* Add support for tb operator fusion

* TB_FORLOOP_ACCUM_OP->TB_FORLOOP_ACCUM_NO_RED_OP

* Refine documents

* nits

* nits

* Add support for chunked copy and async copy

* fix Python JIT compilation errors

* [CUDA Transpiler] Fix Python JIT compilation errors (sgl-project#51)

* fix Python JIT compilation errors

* Remove __getattr__ from wrapper

---------

Co-authored-by: SpiritedAwayCN <[email protected]>

* checkpoint

* Bugfix in async copy

* Optimize matmul

* Add support for output chunked copy

* Python interface for creating threadblock graph

* rename python objects

* Optimize ClearAccumulatorKernel

* Add testcases for IO

* Optimize threadblock input ops

* support customized

* Small optimization

* Change memory alignment to 128B for dtensor and stensors

* bug fixes

* remove the Py prefix for tensor objects in Python

* fix typo

* fix typo

* Modify lib.h to adapt to PR sgl-project#53

* Bugfix in testcase

* Support in-register accumulation for matmul

* Rewrite tb scheduling

* Add support for advanced memory planning interface & algos

* Allocate software pipeline buffers in memory planner too

* Code formatting

* Fix test script

* Add doc for TB scheduling and memory planning

* Add doc for register-backed accumulator

* Refine tb elementwise binary operator for broadcast support

* Add test for tb elementwise binary operator with broadcast

* Fix a subtle bug in matmul

* Add some comments

* Refine TB input and output ops (do not rely on stride)

* Refine TB reduction kernel to avoid using stride

* nits

* Refine matmul operator: do not rely on stride

* Slightly reorganize procedure in Transpiler

* Change matmul perf args to align with tb matmul perf

* Rename a file

* Add support for swizzling (XOR and SHIFT) (has a bug)

* Add doc for swizzling

* Add a test for the SHIFT swizzling

* Fix issue (sgl-project#56)

* Format boolean variable as `true` and `false` for better readability

* nits

* nits

* Hotfix CuTe's bug

Ref: NVIDIA/cutlass#1766

* Bump Cutlass's version and remove the workaround in the last commit

* Remove debug info

* Update doc

* Bugfix

* [Transpiler] Add more Python examples for transpiler testing (sgl-project#57)

* bug fixes

* fix compile issue

* minor updates

---------

Co-authored-by: interestingLSY <[email protected]>
Co-authored-by: Chun'an Shi <[email protected]>
Co-authored-by: SpiritedAwayCN <[email protected]>
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request May 28, 2025
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Jun 6, 2025
pi314ever added a commit to pi314ever/sglang that referenced this pull request Jun 21, 2025
* remove vllm as a dependency

revert registry

* Isort format

Signed-off-by: Daniel Huang <[email protected]>

* Fix isort

Signed-off-by: Daniel Huang <[email protected]>

* Fix skip vllm import logic

Signed-off-by: Daniel Huang <[email protected]>

* Fix more vllm imports

Signed-off-by: Daniel Huang <[email protected]>

* Update to a newer vllm-hpu-extension version that does not rely on vllm

Signed-off-by: Daniel Huang <[email protected]>

* Add ray dep

Signed-off-by: Daniel Huang <[email protected]>

* Add einops

Signed-off-by: Daniel Huang <[email protected]>

---------

Signed-off-by: Daniel Huang <[email protected]>
Co-authored-by: Yang Wang <[email protected]>
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Jun 25, 2025
blzheng pushed a commit to blzheng/sglang that referenced this pull request Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenAI Chat Completion Endpoint

3 participants