EricLBuehler · EricLBuehler · Mar 13, 2025 · Mar 12, 2025 · Mar 13, 2025 · Mar 13, 2025
diff --git a/README.md b/README.md
@@ -31,7 +31,7 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis
 - Check out UQFF for prequantized models of various methods!
     - Models can be found [here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).
 
-- 💎💎💎 Run the **Gemma 3** Model (*text only for now, vision coming very soon!*):
+- 💎💎💎 Run the **Gemma 3** Model with 128k context length and vision support: [documentation](docs/GEMMA3.md)
 
     ```
     ./mistralrs-server -i vision-plain -m google/gemma-3-4b-it -a gemma3

diff --git a/docs/GEMMA3.md b/docs/GEMMA3.md
@@ -0,0 +1,192 @@
+# Gemma 3 Model: [`google/gemma-3-4b-it`](https://huggingface.co/google/gemma-3-4b-it)
+
+The Gemma 3 model is a family of multimodal (text+vision) models with 128k context length. The collection can be found [here](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d), with model sizes ranging from 4B to 27B.
+
+We support the Gemma 3 Model in the Rust, Python, and HTTP APIs, including ISQ for increased performance.
+
+The Python and HTTP APIs support sending images as:
+- URL
+- Path to a local image
+- [Base64](https://en.wikipedia.org/wiki/Base64) encoded string
+
+The Rust API takes an image from the [image](https://docs.rs/image/latest/image/index.html) crate.
+
+## HTTP server
+You can find this example [here](../examples/server/gemma3.py).
+
+We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.
+
+> Note: The image_url may be either a path, URL, or a base64 encoded string.
+
+---
+
+**Image:**
+<img src="https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg" alt="Mount Washington" width = "1000" height = "666">
+<h6><a href = "https://www.nhmagazine.com/mount-washington/">Credit</a></h6>
+
+**Prompt:**
+```
+image shows Mount Washington in New Hampshire, USA. It's a prominent peak in the White Mountains, known for its extreme weather conditions and being the highest peak in the Northeastern United States. The image captures it covered in snow with a dramatic sky above. The structures at the summit are communication towers.
+
+
+
+The winding path visible on the mountain slopes appears to be part of the Mount Washington Auto Road, a historic road that allows vehicles to drive to the summit.
+```
+
+**Output:**
+```
+A mountain with snow on it.
+```
+
+---
+
+1) Start the server
+
+> [!NOTE]
+> You should replace `--features ...` with one of the features specified [here](../README.md#supported-accelerators), or remove it for pure CPU inference.
+
+```
+cargo run --release --features ... -- --port 1234 vision-plain -m google/gemma-3-12b-it -a gemma3
+```
+
+2) Send a request
+
+```py
+from openai import OpenAI
+import httpx
+import textwrap
+import json
+
+
+client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
+
+
+completion = client.chat.completions.create(
+    model="gemma3",
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
+                    },
+                },
+                {
+                    "type": "text",
+                    "text": "What is this?",
+                },
+            ],
+        },
+    ],
+    max_tokens=256,
+    frequency_penalty=1.0,
+    top_p=0.1,
+    temperature=0,
+)
+resp = completion.choices[0].message.content
+print(resp)
+
+```
+
+- You can find an example of encoding the [image via base64 here](../examples/server/phi3v_base64.py).
+- You can find an example of loading an [image locally here](../examples/server/phi3v_local_img.py).
+
+---
+
+## Rust
+You can find this example [here](../mistralrs/examples/gemma3/main.rs).
+
+This is a minimal example of running the Phi 4 Multimodal model with a dummy image.
+
+```rust
+use anyhow::Result;
+use mistralrs::{IsqType, TextMessageRole, VisionLoaderType, VisionMessages, VisionModelBuilder};
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let model =
+        VisionModelBuilder::new("google/gemma-3-12b-it", VisionLoaderType::Gemma3)
+            .with_isq(IsqType::Q4K)
+            .with_logging()
+            .build()
+            .await?;
+
+    let bytes = match reqwest::blocking::get(
+        "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
+    ) {
+        Ok(http_resp) => http_resp.bytes()?.to_vec(),
+        Err(e) => anyhow::bail!(e),
+    };
+    let image = image::load_from_memory(&bytes)?;
+
+    let messages = VisionMessages::new().add_image_message(
+        TextMessageRole::User,
+        "What is depicted here? Please describe the scene in detail.",
+        image,
+        &model,
+    )?;
+
+    let response = model.send_chat_request(messages).await?;
+
+    println!("{}", response.choices[0].message.content.as_ref().unwrap());
+    dbg!(
+        response.usage.avg_prompt_tok_per_sec,
+        response.usage.avg_compl_tok_per_sec
+    );
+
+    Ok(())
+}
+```
+
+## Python
+You can find this example [here](../examples/python/gemma3.py).
+
+This example demonstrates loading and sending a chat completion request with an image.
+
+> Note: the image_url may be either a path, URL, or a base64 encoded string.
+
+```py
+from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
+
+runner = Runner(
+    which=Which.VisionPlain(
+        model_id="google/gemma-3-12b-it",
+        arch=VisionArchitecture.Gemma3,
+    ),
+)
+
+res = runner.send_chat_completion_request(
+    ChatCompletionRequest(
+        model="gemma3",
+        messages=[
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "image_url",
+                        "image_url": {
+                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
+                        },
+                    },
+                    {
+                        "type": "text",
+                        "text": "What is this?",
+                    },
+                ],
+            }
+        ],
+        max_tokens=256,
+        presence_penalty=1.0,
+        top_p=0.1,
+        temperature=0.1,
+    )
+)
+print(res.choices[0].message.content)
+print(res.usage)
+
+```
+
+- You can find an example of encoding the [image via base64 here](../examples/python/phi3v_base64.py).
+- You can find an example of loading an [image locally here](../examples/python/phi3v_local_img.py).
diff --git a/examples/python/gemma3.py b/examples/python/gemma3.py
@@ -0,0 +1,37 @@
+from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture
+
+runner = Runner(
+    which=Which.VisionPlain(
+        model_id="google/gemma-3-12b-it",
+        arch=VisionArchitecture.Gemma3,
+    ),
+)
+
+res = runner.send_chat_completion_request(
+    ChatCompletionRequest(
+        model="gemma3",
+        messages=[
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "image_url",
+                        "image_url": {
+                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
+                        },
+                    },
+                    {
+                        "type": "text",
+                        "text": "What is this?",
+                    },
+                ],
+            }
+        ],
+        max_tokens=256,
+        presence_penalty=1.0,
+        top_p=0.1,
+        temperature=0.1,
+    )
+)
+print(res.choices[0].message.content)
+print(res.usage)
diff --git a/examples/server/gemma3.py b/examples/server/gemma3.py
@@ -0,0 +1,63 @@
+from openai import OpenAI
+import httpx
+import textwrap
+import json
+
+
+def log_response(response: httpx.Response):
+    request = response.request
+    print(f"Request: {request.method} {request.url}")
+    print("  Headers:")
+    for key, value in request.headers.items():
+        if key.lower() == "authorization":
+            value = "[...]"
+        if key.lower() == "cookie":
+            value = value.split("=")[0] + "=..."
+        print(f"    {key}: {value}")
+    print("  Body:")
+    try:
+        request_body = json.loads(request.content)
+        print(textwrap.indent(json.dumps(request_body, indent=2), "    "))
+    except json.JSONDecodeError:
+        print(textwrap.indent(request.content.decode(), "    "))
+    print(f"Response: status_code={response.status_code}")
+    print("  Headers:")
+    for key, value in response.headers.items():
+        if key.lower() == "set-cookie":
+            value = value.split("=")[0] + "=..."
+        print(f"    {key}: {value}")
+
+
+client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
+
+# Enable this to log requests and responses
+# client._client = httpx.Client(
+#     event_hooks={"request": [print], "response": [log_response]}
+# )
+
+completion = client.chat.completions.create(
+    model="gemma3",
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
+                    },
+                },
+                {
+                    "type": "text",
+                    "text": "What is this?",
+                },
+            ],
+        },
+    ],
+    max_tokens=256,
+    frequency_penalty=1.0,
+    top_p=0.1,
+    temperature=0,
+)
+resp = completion.choices[0].message.content
+print(resp)
diff --git a/mistralrs-core/src/attention.rs b/mistralrs-core/src/attention.rs
@@ -226,7 +226,7 @@ fn naive_sdpa(
 
         candle_nn::ops::inplace_attn_softmax_last_dim(
             &mut att,
-            &mask,
+            &mask.contiguous()?,
             sdpa_params.softmax_scale / sdpa_params.softcap.unwrap_or(1.0),
         )?;
 

diff --git a/mistralrs-core/src/pipeline/loaders/vision_loaders.rs b/mistralrs-core/src/pipeline/loaders/vision_loaders.rs
@@ -3134,11 +3134,11 @@ impl VisionModelLoader for Gemma3Loader {
     fn get_processor(
         &self,
         _model_config: &str,
-        _processor_config: Option<ProcessorConfig>,
+        processor_config: Option<ProcessorConfig>,
         _preprocessor_config: PreProcessorConfig,
         _max_edge: Option<u32>,
     ) -> Arc<dyn Processor + Send + Sync> {
-        Arc::new(Gemma3Processor)
+        Arc::new(Gemma3Processor::new(processor_config.unwrap()))
     }
     fn supports_paged_attention(&self) -> bool {
         true

diff --git a/mistralrs-core/src/vision_models/gemma3/config.rs b/mistralrs-core/src/vision_models/gemma3/config.rs
@@ -3,6 +3,7 @@ use mistralrs_quant::QuantizedConfig;
 use crate::{
     layers::{Activation, Gemma3RopeScalingConfig},
     serde_default_fn,
+    vision_models::siglip::SiglipVisionConfig,
 };
 
 serde_default_fn!(bool, attention_bias, false);
@@ -63,4 +64,7 @@ pub struct Gemma3TextConfig {
 #[derive(Debug, Clone, serde::Deserialize)]
 pub struct Gemma3Config {
     pub text_config: Gemma3TextConfig,
+    pub vision_config: SiglipVisionConfig,
+    pub image_token_index: usize,
+    pub mm_tokens_per_image: usize,
 }