From be94aadeeb6138303c33d0de5baebcb7c24d7cb9 Mon Sep 17 00:00:00 2001 From: Mick Date: Fri, 9 May 2025 10:27:38 +0800 Subject: [PATCH 1/7] doc: update developer guide regarding mllms --- docs/supported_models/support_new_models.md | 81 +++++++++++++-------- 1 file changed, 52 insertions(+), 29 deletions(-) diff --git a/docs/supported_models/support_new_models.md b/docs/supported_models/support_new_models.md index ae8b664bd4f..d31918550e2 100644 --- a/docs/supported_models/support_new_models.md +++ b/docs/supported_models/support_new_models.md @@ -1,40 +1,56 @@ # How to Support New Models -This document explains how to add support for new language models and vision‐language models (VLMs) in SGLang. It also covers how to test new models and register external implementations. +This document explains how to add support for new language models and vision‐language models (VLMs) in SGLang. It also +covers how to test new models and register external implementations. ## How to Support a new Language Model -To support a new model in SGLang, you only need to add a single file under the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create a new file for your model. For most models, you should be able to find a similar model to start with (e.g., starting from Llama). Also refer how to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) +To support a new model in SGLang, you only need to add a single file under +the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn +from existing model implementations and create a new file for your model. For most models, you should be able to find a +similar model to start with (e.g., starting from Llama). Also refer how +to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) -## How to Support a new Vision-Language model +## How to Support a new Multimodal Large Language Model -To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM support: +To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the +standard LLM support: 1. **Register your new model as multimodal**: - Extend `is_multimodal_model` in [model_config.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return `True` for your model. + Extend `is_multimodal_model` + in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561) + to return `True` for your model. -2. **Process Images**: - Define a new `Processor` class that inherits from `BaseProcessor` and register this processor as your model’s dedicated processor. See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) for more details. +2. **Process Multimodal Data**: + Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your + model’s dedicated processor. + See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) + for more details. -3. **Handle Image Tokens**: - Implement a `pad_input_ids` function for your new model. In this function, image tokens in the prompt should be expanded and replaced with image-hashes so that SGLang can recognize different images when using `RadixAttention`. +3. **Handle Multimodal Tokens**: + Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be + expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data + with `RadixAttention`. -4. **Replace Vision Attention**: - Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. +4. **Adapt with Vision Attention**: + Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. -You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLM implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. +You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or +other vLM implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. -You should test the new vLM locally against Hugging Face models. See the [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example. +You should test the new MLLM locally against Hugging Face models. See the [ +`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example. ## Test the Correctness ### Interactive Debugging -For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands should give the same text output and very similar prefill logits: +For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands +should give the same text output and very similar prefill logits: - Get the reference output: ```bash - python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,vlm} + python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,mllm} ``` - Get the SGLang output: ```bash @@ -43,7 +59,10 @@ For interactive debugging, compare the outputs of Hugging Face/Transformers and ### Add the Model to the Test Suite -To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, MMMU-Pro, etc.) in your PR. +To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in +the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) +file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, +MMMU-Pro, etc.) in your PR. This is the command to test a new model on your local machine: @@ -53,26 +72,29 @@ ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerati ## Port a Model from vLLM to SGLang -The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models from vLLM to SGLang. +The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable +resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models +from vLLM to SGLang. To port a model from vLLM to SGLang: - Compare these two files for guidance: - - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) - - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) + - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) + - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) - The major differences include: - - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). - - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** - - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** - - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. - - **Remove `Sample`.** - - **Change the `forward()` functions** and add a `forward_batch()` method. - - **Add `EntryClass`** at the end. - - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. + - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). + - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** + - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** + - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. + - **Remove `Sample`.** + - **Change the `forward()` functions** and add a `forward_batch()` method. + - **Add `EntryClass`** at the end. + - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. ## Registering an External Model Implementation -In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. This allows you to integrate your model without modifying the source code. +In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. +This allows you to integrate your model without modifying the source code. For example: @@ -101,4 +123,5 @@ launch_server(server_args) --- -By following these guidelines, you can add support for new language models and vision-language models in SGLang and ensure they are thoroughly tested and easily integrated into the system. +By following these guidelines, you can add support for new language models and vision-language models in SGLang and +ensure they are thoroughly tested and easily integrated into the system. From e70d9e0437536c9d1cdbd4f030c950fee4495a08 Mon Sep 17 00:00:00 2001 From: Mick Date: Fri, 9 May 2025 10:30:13 +0800 Subject: [PATCH 2/7] remove `vision` --- docs/supported_models/support_new_models.md | 49 ++++++++------------- 1 file changed, 18 insertions(+), 31 deletions(-) diff --git a/docs/supported_models/support_new_models.md b/docs/supported_models/support_new_models.md index d31918550e2..f0a911f271d 100644 --- a/docs/supported_models/support_new_models.md +++ b/docs/supported_models/support_new_models.md @@ -1,15 +1,10 @@ # How to Support New Models -This document explains how to add support for new language models and vision‐language models (VLMs) in SGLang. It also -covers how to test new models and register external implementations. +This document explains how to add support for new language models and multimodal large language models (mllms) in SGLang. It also covers how to test new models and register external implementations. ## How to Support a new Language Model -To support a new model in SGLang, you only need to add a single file under -the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn -from existing model implementations and create a new file for your model. For most models, you should be able to find a -similar model to start with (e.g., starting from Llama). Also refer how -to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) +To support a new model in SGLang, you only need to add a single file under the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create a new file for your model. For most models, you should be able to find a similar model to start with (e.g., starting from Llama). Also refer how to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) ## How to Support a new Multimodal Large Language Model @@ -36,7 +31,7 @@ standard LLM support: Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or -other vLM implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. +other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. You should test the new MLLM locally against Hugging Face models. See the [ `mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example. @@ -45,8 +40,7 @@ You should test the new MLLM locally against Hugging Face models. See the [ ### Interactive Debugging -For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands -should give the same text output and very similar prefill logits: +For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands should give the same text output and very similar prefill logits: - Get the reference output: ```bash @@ -59,10 +53,7 @@ should give the same text output and very similar prefill logits: ### Add the Model to the Test Suite -To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in -the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) -file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, -MMMU-Pro, etc.) in your PR. +To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, MMMU-Pro, etc.) in your PR. This is the command to test a new model on your local machine: @@ -72,29 +63,26 @@ ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerati ## Port a Model from vLLM to SGLang -The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable -resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models -from vLLM to SGLang. +The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models from vLLM to SGLang. To port a model from vLLM to SGLang: - Compare these two files for guidance: - - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) - - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) + - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) + - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) - The major differences include: - - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). - - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** - - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** - - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. - - **Remove `Sample`.** - - **Change the `forward()` functions** and add a `forward_batch()` method. - - **Add `EntryClass`** at the end. - - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. + - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). + - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** + - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** + - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. + - **Remove `Sample`.** + - **Change the `forward()` functions** and add a `forward_batch()` method. + - **Add `EntryClass`** at the end. + - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. ## Registering an External Model Implementation -In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. -This allows you to integrate your model without modifying the source code. +In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. This allows you to integrate your model without modifying the source code. For example: @@ -123,5 +111,4 @@ launch_server(server_args) --- -By following these guidelines, you can add support for new language models and vision-language models in SGLang and -ensure they are thoroughly tested and easily integrated into the system. +By following these guidelines, you can add support for new language models and multimodal large language models in SGLang and ensure they are thoroughly tested and easily integrated into the system. From 655c3cfe29d129a019eedfe138eb012448da616c Mon Sep 17 00:00:00 2001 From: Mick Date: Fri, 9 May 2025 10:32:36 +0800 Subject: [PATCH 3/7] remove `vision` --- .../multimodal_language_models.md | 33 +++++++++++++++++++ .../vision_language_models.md | 27 --------------- 2 files changed, 33 insertions(+), 27 deletions(-) create mode 100644 docs/supported_models/multimodal_language_models.md diff --git a/docs/supported_models/multimodal_language_models.md b/docs/supported_models/multimodal_language_models.md new file mode 100644 index 00000000000..e34bdca0067 --- /dev/null +++ b/docs/supported_models/multimodal_language_models.md @@ -0,0 +1,33 @@ +# Multimodal Language Models + +These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models +with multimodal encoders and require a specific chat template for handling multimodal prompts. + +```{important} +We need to specify `--chat-template` for VLMs because the chat template provided in HuggingFace tokenizer only supports text. If you do not specify a multimodal model’s `--chat-template`, the server uses HuggingFace’s default template, which only supports text and the images won’t be passed in. +``` + +## Example launch Command + +```shell +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path + --chat-template llama_3_vision \ # required chat template + --host 0.0.0.0 \ + --port 30000 \ +``` + +## Supporting Matrixs + +| Model Family (Variants) | Example HuggingFace Identifier | Chat Template | Description | +|----------------------------|--------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **Qwen-VL** (Qwen2 series) | `Qwen/Qwen2.5-VL-7B-Instruct` | `qwen2-vl` | Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. | +| **DeepSeek-VL2** | `deepseek-ai/deepseek-vl2` | `deepseek-vl2` | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. | +| **Janus-Pro** (1B, 7B) | `deepseek-ai/Janus-Pro-7B` | `janus-pro` | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. | +| **MiniCPM-V / MiniCPM-o** | `openbmb/MiniCPM-V-2_6` | `minicpmv` | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. | +| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | `llama_3_vision` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. | +| **LLaVA** (v1.5 & v1.6) | *e.g.* `liuhaotian/llava-v1.5-13b` | `vicuna_v1.1` | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. | +| **LLaVA-NeXT** (8B, 72B) | `lmms-lab/llava-next-72b` | `chatml-llava` | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. | +| **LLaVA-OneVision** | `lmms-lab/llava-onevision-qwen2-7b-ov` | `chatml-llava` | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. | +| **Gemma 3 (Multimodal)** | `google/gemma-3-4b-it` | `gemma-it` | Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. | +| **Kimi-VL** (A3B) | `moonshotai/Kimi-VL-A3B-Instruct` | `kimi-vl` | Kimi-VL is a multimodal model that can understand and generate text from images. | diff --git a/docs/supported_models/vision_language_models.md b/docs/supported_models/vision_language_models.md index a9f4a819792..e69de29bb2d 100644 --- a/docs/supported_models/vision_language_models.md +++ b/docs/supported_models/vision_language_models.md @@ -1,27 +0,0 @@ -# Vision Language Models - -These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with visual encoders and require a specific chat template for handling vision prompts. - -## Example launch Command - -```shell -python3 -m sglang.launch_server \ - --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path - --host 0.0.0.0 \ - --port 30000 \ -``` - -## Supporting Matrixs - -| Model Family (Variants) | Example HuggingFace Identifier | Chat Template | Description | -|--------------------------------|--------------------------------------------------|----------------------|----------------------------------------------------------------------------------------| -| **Qwen-VL** (Qwen2 series) | `Qwen/Qwen2.5-VL-7B-Instruct` | `qwen2-vl` | Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. | -| **DeepSeek-VL2** | `deepseek-ai/deepseek-vl2` | `deepseek-vl2` | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. | -| **Janus-Pro** (1B, 7B) | `deepseek-ai/Janus-Pro-7B` | `janus-pro` | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. | -| **MiniCPM-V / MiniCPM-o** | `openbmb/MiniCPM-V-2_6` | `minicpmv` | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. | -| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | `llama_3_vision` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. | -| **LLaVA** (v1.5 & v1.6) | *e.g.* `liuhaotian/llava-v1.5-13b` | `vicuna_v1.1` | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. | -| **LLaVA-NeXT** (8B, 72B) | `lmms-lab/llava-next-72b` | `chatml-llava` | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. | -| **LLaVA-OneVision** | `lmms-lab/llava-onevision-qwen2-7b-ov` | `chatml-llava` | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. | -| **Gemma 3 (Multimodal)** | `google/gemma-3-4b-it` | `gemma-it` | Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. | -| **Kimi-VL** (A3B) | `moonshotai/Kimi-VL-A3B-Instruct` | `kimi-vl` | Kimi-VL is a multimodal model that can understand and generate text from images. | From 4ef3b1c81a6b2d767598e47ed2139aaeb4fea3f1 Mon Sep 17 00:00:00 2001 From: Mick Date: Fri, 9 May 2025 10:33:56 +0800 Subject: [PATCH 4/7] minor --- docs/supported_models/support_new_models.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/supported_models/support_new_models.md b/docs/supported_models/support_new_models.md index f0a911f271d..8b3aa7810a7 100644 --- a/docs/supported_models/support_new_models.md +++ b/docs/supported_models/support_new_models.md @@ -16,7 +16,7 @@ standard LLM support: in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561) to return `True` for your model. -2. **Process Multimodal Data**: +2. **Multimodal Data Processor**: Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your model’s dedicated processor. See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) @@ -27,7 +27,7 @@ standard LLM support: expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data with `RadixAttention`. -4. **Adapt with Vision Attention**: +4. **Adapt to Vision Attention**: Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or From be2d78ee65004f774156a74810ddc4b978f821f0 Mon Sep 17 00:00:00 2001 From: Mick Date: Fri, 9 May 2025 12:28:31 +0800 Subject: [PATCH 5/7] chat-template --- .../multimodal_language_models.md | 9 +-- docs/supported_models/support_new_models.md | 56 ++++++++++++------- 2 files changed, 38 insertions(+), 27 deletions(-) diff --git a/docs/supported_models/multimodal_language_models.md b/docs/supported_models/multimodal_language_models.md index e34bdca0067..8aa34aa02d5 100644 --- a/docs/supported_models/multimodal_language_models.md +++ b/docs/supported_models/multimodal_language_models.md @@ -1,23 +1,18 @@ # Multimodal Language Models These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models -with multimodal encoders and require a specific chat template for handling multimodal prompts. - -```{important} -We need to specify `--chat-template` for VLMs because the chat template provided in HuggingFace tokenizer only supports text. If you do not specify a multimodal model’s `--chat-template`, the server uses HuggingFace’s default template, which only supports text and the images won’t be passed in. -``` +with multimodal encoders. ## Example launch Command ```shell python3 -m sglang.launch_server \ --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path - --chat-template llama_3_vision \ # required chat template --host 0.0.0.0 \ --port 30000 \ ``` -## Supporting Matrixs +## Supporting Matrics | Model Family (Variants) | Example HuggingFace Identifier | Chat Template | Description | |----------------------------|--------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| diff --git a/docs/supported_models/support_new_models.md b/docs/supported_models/support_new_models.md index 8b3aa7810a7..d7610d60030 100644 --- a/docs/supported_models/support_new_models.md +++ b/docs/supported_models/support_new_models.md @@ -1,10 +1,15 @@ # How to Support New Models -This document explains how to add support for new language models and multimodal large language models (mllms) in SGLang. It also covers how to test new models and register external implementations. +This document explains how to add support for new language models and multimodal large language models (mllms) in +SGLang. It also covers how to test new models and register external implementations. ## How to Support a new Language Model -To support a new model in SGLang, you only need to add a single file under the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create a new file for your model. For most models, you should be able to find a similar model to start with (e.g., starting from Llama). Also refer how to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) +To support a new model in SGLang, you only need to add a single file under +the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn +from existing model implementations and create a new file for your model. For most models, you should be able to find a +similar model to start with (e.g., starting from Llama). Also refer how +to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) ## How to Support a new Multimodal Large Language Model @@ -16,18 +21,21 @@ standard LLM support: in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561) to return `True` for your model. -2. **Multimodal Data Processor**: +2. **Register a new chat-template** + See [conversation.py](https://github.com/sgl-project/sglang/blob/86a779dbe9e815c02f71ea82574608f6eae016b5/python/sglang/srt/conversation.py) + +3. **Multimodal Data Processor**: Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your model’s dedicated processor. See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) for more details. -3. **Handle Multimodal Tokens**: +4. **Handle Multimodal Tokens**: Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data with `RadixAttention`. -4. **Adapt to Vision Attention**: +5. **Adapt to Vision Attention**: Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or @@ -40,7 +48,8 @@ You should test the new MLLM locally against Hugging Face models. See the [ ### Interactive Debugging -For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands should give the same text output and very similar prefill logits: +For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands +should give the same text output and very similar prefill logits: - Get the reference output: ```bash @@ -53,7 +62,10 @@ For interactive debugging, compare the outputs of Hugging Face/Transformers and ### Add the Model to the Test Suite -To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, MMMU-Pro, etc.) in your PR. +To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in +the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) +file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, +MMMU-Pro, etc.) in your PR. This is the command to test a new model on your local machine: @@ -63,26 +75,29 @@ ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerati ## Port a Model from vLLM to SGLang -The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models from vLLM to SGLang. +The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable +resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models +from vLLM to SGLang. To port a model from vLLM to SGLang: - Compare these two files for guidance: - - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) - - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) + - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) + - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) - The major differences include: - - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). - - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** - - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** - - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. - - **Remove `Sample`.** - - **Change the `forward()` functions** and add a `forward_batch()` method. - - **Add `EntryClass`** at the end. - - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. + - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). + - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** + - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** + - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. + - **Remove `Sample`.** + - **Change the `forward()` functions** and add a `forward_batch()` method. + - **Add `EntryClass`** at the end. + - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. ## Registering an External Model Implementation -In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. This allows you to integrate your model without modifying the source code. +In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. +This allows you to integrate your model without modifying the source code. For example: @@ -111,4 +126,5 @@ launch_server(server_args) --- -By following these guidelines, you can add support for new language models and multimodal large language models in SGLang and ensure they are thoroughly tested and easily integrated into the system. +By following these guidelines, you can add support for new language models and multimodal large language models in +SGLang and ensure they are thoroughly tested and easily integrated into the system. From 4a224d8bdb598b7be9cd101a17ae5417f65b9f92 Mon Sep 17 00:00:00 2001 From: Xinyuan Tong Date: Wed, 14 May 2025 06:29:43 +0000 Subject: [PATCH 6/7] Update documentation to replace vision_language_models with multimodal_language_models and remove the obsolete vision_language_models file. Signed-off-by: Xinyuan Tong --- docs/index.rst | 2 +- docs/supported_models/vision_language_models.md | 0 2 files changed, 1 insertion(+), 1 deletion(-) delete mode 100644 docs/supported_models/vision_language_models.md diff --git a/docs/index.rst b/docs/index.rst index e98d5d95ba2..eac4cbd8f10 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -38,7 +38,7 @@ The core features include: :caption: Supported Models supported_models/generative_models.md - supported_models/vision_language_models.md + supported_models/multimodal_language_models.md supported_models/embedding_models.md supported_models/reward_models.md supported_models/support_new_models.md diff --git a/docs/supported_models/vision_language_models.md b/docs/supported_models/vision_language_models.md deleted file mode 100644 index e69de29bb2d..00000000000 From c80275fd8c81fda3b18e731925bbb9a6363be657 Mon Sep 17 00:00:00 2001 From: Xinyuan Tong Date: Wed, 14 May 2025 06:35:07 +0000 Subject: [PATCH 7/7] typo Signed-off-by: Xinyuan Tong --- docs/supported_models/multimodal_language_models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/supported_models/multimodal_language_models.md b/docs/supported_models/multimodal_language_models.md index 8aa34aa02d5..f42c8a0ecba 100644 --- a/docs/supported_models/multimodal_language_models.md +++ b/docs/supported_models/multimodal_language_models.md @@ -12,7 +12,7 @@ python3 -m sglang.launch_server \ --port 30000 \ ``` -## Supporting Matrics +## Supporting Metrics | Model Family (Variants) | Example HuggingFace Identifier | Chat Template | Description | |----------------------------|--------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|