diff --git a/docs/index.rst b/docs/index.rst index e98d5d95ba2..eac4cbd8f10 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -38,7 +38,7 @@ The core features include: :caption: Supported Models supported_models/generative_models.md - supported_models/vision_language_models.md + supported_models/multimodal_language_models.md supported_models/embedding_models.md supported_models/reward_models.md supported_models/support_new_models.md diff --git a/docs/supported_models/multimodal_language_models.md b/docs/supported_models/multimodal_language_models.md new file mode 100644 index 00000000000..f42c8a0ecba --- /dev/null +++ b/docs/supported_models/multimodal_language_models.md @@ -0,0 +1,28 @@ +# Multimodal Language Models + +These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models +with multimodal encoders. + +## Example launch Command + +```shell +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path + --host 0.0.0.0 \ + --port 30000 \ +``` + +## Supporting Metrics + +| Model Family (Variants) | Example HuggingFace Identifier | Chat Template | Description | +|----------------------------|--------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **Qwen-VL** (Qwen2 series) | `Qwen/Qwen2.5-VL-7B-Instruct` | `qwen2-vl` | Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. | +| **DeepSeek-VL2** | `deepseek-ai/deepseek-vl2` | `deepseek-vl2` | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. | +| **Janus-Pro** (1B, 7B) | `deepseek-ai/Janus-Pro-7B` | `janus-pro` | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. | +| **MiniCPM-V / MiniCPM-o** | `openbmb/MiniCPM-V-2_6` | `minicpmv` | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. | +| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | `llama_3_vision` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. | +| **LLaVA** (v1.5 & v1.6) | *e.g.* `liuhaotian/llava-v1.5-13b` | `vicuna_v1.1` | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. | +| **LLaVA-NeXT** (8B, 72B) | `lmms-lab/llava-next-72b` | `chatml-llava` | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. | +| **LLaVA-OneVision** | `lmms-lab/llava-onevision-qwen2-7b-ov` | `chatml-llava` | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. | +| **Gemma 3 (Multimodal)** | `google/gemma-3-4b-it` | `gemma-it` | Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. | +| **Kimi-VL** (A3B) | `moonshotai/Kimi-VL-A3B-Instruct` | `kimi-vl` | Kimi-VL is a multimodal model that can understand and generate text from images. | diff --git a/docs/supported_models/support_new_models.md b/docs/supported_models/support_new_models.md index ae8b664bd4f..d7610d60030 100644 --- a/docs/supported_models/support_new_models.md +++ b/docs/supported_models/support_new_models.md @@ -1,40 +1,59 @@ # How to Support New Models -This document explains how to add support for new language models and vision‐language models (VLMs) in SGLang. It also covers how to test new models and register external implementations. +This document explains how to add support for new language models and multimodal large language models (mllms) in +SGLang. It also covers how to test new models and register external implementations. ## How to Support a new Language Model -To support a new model in SGLang, you only need to add a single file under the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create a new file for your model. For most models, you should be able to find a similar model to start with (e.g., starting from Llama). Also refer how to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) +To support a new model in SGLang, you only need to add a single file under +the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn +from existing model implementations and create a new file for your model. For most models, you should be able to find a +similar model to start with (e.g., starting from Llama). Also refer how +to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang) -## How to Support a new Vision-Language model +## How to Support a new Multimodal Large Language Model -To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM support: +To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the +standard LLM support: 1. **Register your new model as multimodal**: - Extend `is_multimodal_model` in [model_config.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return `True` for your model. + Extend `is_multimodal_model` + in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561) + to return `True` for your model. -2. **Process Images**: - Define a new `Processor` class that inherits from `BaseProcessor` and register this processor as your model’s dedicated processor. See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) for more details. +2. **Register a new chat-template** + See [conversation.py](https://github.com/sgl-project/sglang/blob/86a779dbe9e815c02f71ea82574608f6eae016b5/python/sglang/srt/conversation.py) -3. **Handle Image Tokens**: - Implement a `pad_input_ids` function for your new model. In this function, image tokens in the prompt should be expanded and replaced with image-hashes so that SGLang can recognize different images when using `RadixAttention`. +3. **Multimodal Data Processor**: + Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your + model’s dedicated processor. + See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) + for more details. -4. **Replace Vision Attention**: - Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. +4. **Handle Multimodal Tokens**: + Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be + expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data + with `RadixAttention`. -You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLM implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. +5. **Adapt to Vision Attention**: + Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`. -You should test the new vLM locally against Hugging Face models. See the [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example. +You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or +other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs. + +You should test the new MLLM locally against Hugging Face models. See the [ +`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example. ## Test the Correctness ### Interactive Debugging -For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands should give the same text output and very similar prefill logits: +For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands +should give the same text output and very similar prefill logits: - Get the reference output: ```bash - python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,vlm} + python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,mllm} ``` - Get the SGLang output: ```bash @@ -43,7 +62,10 @@ For interactive debugging, compare the outputs of Hugging Face/Transformers and ### Add the Model to the Test Suite -To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, MMMU-Pro, etc.) in your PR. +To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in +the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) +file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, +MMMU-Pro, etc.) in your PR. This is the command to test a new model on your local machine: @@ -53,26 +75,29 @@ ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerati ## Port a Model from vLLM to SGLang -The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models from vLLM to SGLang. +The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable +resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models +from vLLM to SGLang. To port a model from vLLM to SGLang: - Compare these two files for guidance: - - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) - - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) + - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) + - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) - The major differences include: - - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). - - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** - - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** - - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. - - **Remove `Sample`.** - - **Change the `forward()` functions** and add a `forward_batch()` method. - - **Add `EntryClass`** at the end. - - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. + - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`). + - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.** + - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.** + - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers. + - **Remove `Sample`.** + - **Change the `forward()` functions** and add a `forward_batch()` method. + - **Add `EntryClass`** at the end. + - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components. ## Registering an External Model Implementation -In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. This allows you to integrate your model without modifying the source code. +In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. +This allows you to integrate your model without modifying the source code. For example: @@ -101,4 +126,5 @@ launch_server(server_args) --- -By following these guidelines, you can add support for new language models and vision-language models in SGLang and ensure they are thoroughly tested and easily integrated into the system. +By following these guidelines, you can add support for new language models and multimodal large language models in +SGLang and ensure they are thoroughly tested and easily integrated into the system. diff --git a/docs/supported_models/vision_language_models.md b/docs/supported_models/vision_language_models.md deleted file mode 100644 index d2d9f1d80d4..00000000000 --- a/docs/supported_models/vision_language_models.md +++ /dev/null @@ -1,28 +0,0 @@ -# Vision Language Models - -These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with visual encoders and require a specific chat template for handling vision prompts. - -## Example launch Command - -```shell -python3 -m sglang.launch_server \ - --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path - --host 0.0.0.0 \ - --port 30000 \ -``` - -## Supporting Matrixs - -| Model Family (Variants) | Example HuggingFace Identifier | Chat Template | Description | -|--------------------------------|--------------------------------------------------|----------------------|----------------------------------------------------------------------------------------| -| **Qwen-VL** (Qwen2 series) | `Qwen/Qwen2.5-VL-7B-Instruct` | `qwen2-vl` | Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. | -| **DeepSeek-VL2** | `deepseek-ai/deepseek-vl2` | `deepseek-vl2` | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. | -| **Janus-Pro** (1B, 7B) | `deepseek-ai/Janus-Pro-7B` | `janus-pro` | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. | -| **MiniCPM-V / MiniCPM-o** | `openbmb/MiniCPM-V-2_6` | `minicpmv` | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. | -| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | `llama_3_vision` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. | -| **Pixtral** (12B, 124B) | `mistral-community/pixtral-12b` | `mistral` | Pixtral is a vision-language model from Mistral AI that can process both text and images. | -| **LLaVA** (v1.5 & v1.6) | *e.g.* `liuhaotian/llava-v1.5-13b` | `vicuna_v1.1` | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. | -| **LLaVA-NeXT** (8B, 72B) | `lmms-lab/llava-next-72b` | `chatml-llava` | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. | -| **LLaVA-OneVision** | `lmms-lab/llava-onevision-qwen2-7b-ov` | `chatml-llava` | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. | -| **Gemma 3 (Multimodal)** | `google/gemma-3-4b-it` | `gemma-it` | Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. | -| **Kimi-VL** (A3B) | `moonshotai/Kimi-VL-A3B-Instruct` | `kimi-vl` | Kimi-VL is a multimodal model that can understand and generate text from images. |