Support v1/chat/completions #50

comaniac · 2024-01-19T01:07:28Z

Close #26

Support OpenAI compatible chat API with and without streaming.
Support builtin and dynamic registered chat templates.

merrymercy · 2024-01-19T01:18:11Z

For the chat template, can we use the hugging face tokenizer by default? https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates

comaniac · 2024-01-19T01:19:56Z

For the chat template, can we use the hugging face tokenizer by default? https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates

Good idea. Let me add this.

comaniac · 2024-01-19T01:43:07Z

Now if --chat-template is not specified, we use the tokenizer build-in chat template. Thus, most users should not worry about the chat template for HF models. Meanwhile, I found that some tokenizers such as TinyLlama does not have chat template, so we will get the following when calling apply_chat_template(messages, tokenize=False, add_generation_prompt=True):

No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model
, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

'<s>[INST] <<SYS>>\nYou are a helpful AI assistant\n<</SYS>>\n\nList 3 countries and their capitals. [/INST]'

And this template is actually incorrect for this model, so you will get the following response in the unit test:

 <<SYS>>
You are a helpful AI assistant
<</SYS>>

List 4 cities with more than 100,000 people. [/INST] <<SYS>>
You are a helpful AI assistant
<</SYS>>

List 5

The following response is expected with ChatML template:

Here are three countries and their capitals:

1. Canada - Ottawa
2. Australia - Canberra
3. South Korea - Seoul

Another issue is apply_chat_template doesn't provide stop strings, so the response may be undesired. For example, Llama-2 template should have stop string "[INST]", "[/INST]", "<<SYS>>", "<</SYS>>". In general, I would still suggest specify chat template explicitly, but maybe we could put this issue in the troubleshooting.

merrymercy · 2024-01-19T07:41:21Z

python/sglang/srt/conversation.py

@@ -0,0 +1,381 @@
+# Adapted from
+# https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py


We can consider importing instead of copying later.

python/sglang/srt/server_args.py

…roject#50) * [Transpiler] Add config class and interface for the transpiler * [Misc] Add a static checking for STensor::num_elements * [Transpiler] Implement all kernel operators and their runtime * Format code * Add support for broadcast in ElemwiseBinaryKernel * Refine Python interface for transpiler & Add layout resolve for DTensor * Finish layout resolution for threadblock level ops * Add basic support for threadblock level transpiling * Merge Python JIT frontend (sgl-project#40). Add support for Python JIT frontend * temp frontend * bug fix * bug fix * frontend with output shapes/strides * add jit demo * Add checking for MIRAGE_ROOT --------- Co-authored-by: Shengyu Liu <[email protected]> * Add unit test for the transpiler * Add document for the transpiler * Add threadblock level matmul operator * Add threadblock-level reduction op * clang format * nits * Remove nonexist examples * Merge from main * Add tb scheduling & Add support for forloop accumulator * Add support for tb operator fusion * TB_FORLOOP_ACCUM_OP->TB_FORLOOP_ACCUM_NO_RED_OP * Refine documents * nits * nits * Add support for chunked copy and async copy * fix Python JIT compilation errors * [CUDA Transpiler] Fix Python JIT compilation errors (sgl-project#51) * fix Python JIT compilation errors * Remove __getattr__ from wrapper --------- Co-authored-by: SpiritedAwayCN <[email protected]> * checkpoint * Bugfix in async copy * Optimize matmul * Add support for output chunked copy * Python interface for creating threadblock graph * rename python objects * Optimize ClearAccumulatorKernel * Add testcases for IO * Optimize threadblock input ops * support customized * Small optimization * Change memory alignment to 128B for dtensor and stensors * bug fixes * remove the Py prefix for tensor objects in Python * fix typo * fix typo * Modify lib.h to adapt to PR sgl-project#53 * Bugfix in testcase * Support in-register accumulation for matmul * Rewrite tb scheduling * Add support for advanced memory planning interface & algos * Allocate software pipeline buffers in memory planner too * Code formatting * Fix test script * Add doc for TB scheduling and memory planning * Add doc for register-backed accumulator * Refine tb elementwise binary operator for broadcast support * Add test for tb elementwise binary operator with broadcast * Fix a subtle bug in matmul * Add some comments * Refine TB input and output ops (do not rely on stride) * Refine TB reduction kernel to avoid using stride * nits * Refine matmul operator: do not rely on stride * Slightly reorganize procedure in Transpiler * Change matmul perf args to align with tb matmul perf * Rename a file * Add support for swizzling (XOR and SHIFT) (has a bug) * Add doc for swizzling * Add a test for the SHIFT swizzling * Fix issue (sgl-project#56) * Format boolean variable as `true` and `false` for better readability * nits * nits * Hotfix CuTe's bug Ref: NVIDIA/cutlass#1766 * Bump Cutlass's version and remove the workaround in the last commit * Remove debug info * Update doc * Bugfix * [Transpiler] Add more Python examples for transpiler testing (sgl-project#57) * bug fixes * fix compile issue * minor updates --------- Co-authored-by: interestingLSY <[email protected]> Co-authored-by: Chun'an Shi <[email protected]> Co-authored-by: SpiritedAwayCN <[email protected]>

* remove vllm as a dependency revert registry * Isort format Signed-off-by: Daniel Huang <[email protected]> * Fix isort Signed-off-by: Daniel Huang <[email protected]> * Fix skip vllm import logic Signed-off-by: Daniel Huang <[email protected]> * Fix more vllm imports Signed-off-by: Daniel Huang <[email protected]> * Update to a newer vllm-hpu-extension version that does not rely on vllm Signed-off-by: Daniel Huang <[email protected]> * Add ray dep Signed-off-by: Daniel Huang <[email protected]> * Add einops Signed-off-by: Daniel Huang <[email protected]> --------- Signed-off-by: Daniel Huang <[email protected]> Co-authored-by: Yang Wang <[email protected]>

Support v1/chat/completions

a0fbb66

comaniac requested a review from merrymercy January 19, 2024 01:07

comaniac added 2 commits January 19, 2024 01:08

format

c2eb138

typo

2d7077e

comaniac added 2 commits January 19, 2024 01:35

address comment

850a3a1

README

dc8df5d

merrymercy approved these changes Jan 19, 2024

View reviewed changes

Update python/sglang/srt/server_args.py

350fb35

merrymercy merged commit 23471f9 into main Jan 19, 2024

merrymercy deleted the cody/openai-chat branch January 19, 2024 07:43

CSEEduanyu mentioned this pull request Jan 26, 2025

[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3 #2803

Closed

5 tasks

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Support v1/chat/completions (sgl-project#50)

619463c

blzheng pushed a commit to blzheng/sglang that referenced this pull request Apr 11, 2025

Use sgl_kernel.cpu.fp8_scaled_mm in Fp8LinearMethod (sgl-project#50)

17ca6e0

chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request May 28, 2025

Use sgl_kernel.cpu.fp8_scaled_mm in Fp8LinearMethod (sgl-project#50)

9789486

chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Jun 6, 2025

Use sgl_kernel.cpu.fp8_scaled_mm in Fp8LinearMethod (sgl-project#50)

5c922ef

chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Jun 25, 2025

Use sgl_kernel.cpu.fp8_scaled_mm in Fp8LinearMethod (sgl-project#50)

ae7d006

blzheng pushed a commit to blzheng/sglang that referenced this pull request Jul 31, 2025

Add cpu int4 local (sgl-project#50)

2dc9a3a

gaolaobao mentioned this pull request Aug 25, 2025

[Bug] RTX 5060: RMSNorm failed, same as the #7249 issue, when running qwen2.5-0.5b-instruct model. #9600

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support v1/chat/completions #50

Support v1/chat/completions #50

Uh oh!

comaniac commented Jan 19, 2024

Uh oh!

merrymercy commented Jan 19, 2024

Uh oh!

comaniac commented Jan 19, 2024

Uh oh!

comaniac commented Jan 19, 2024 •

edited

Loading

Uh oh!

merrymercy Jan 19, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,381 @@
		# Adapted from
		# https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py

Support v1/chat/completions #50

Support v1/chat/completions #50

Uh oh!

Conversation

comaniac commented Jan 19, 2024

Uh oh!

merrymercy commented Jan 19, 2024

Uh oh!

comaniac commented Jan 19, 2024

Uh oh!

comaniac commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

comaniac commented Jan 19, 2024 •

edited

Loading