Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis
- Check out UQFF for prequantized models of various methods!
- Models can be found [here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).

- 💎💎💎 Run the **Gemma 3** Model (*text only for now, vision coming very soon!*):
- 💎💎💎 Run the **Gemma 3** Model with 128k context length and vision support: [documentation](docs/GEMMA3.md)

```
./mistralrs-server -i vision-plain -m google/gemma-3-4b-it -a gemma3
Expand Down
192 changes: 192 additions & 0 deletions docs/GEMMA3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Gemma 3 Model: [`google/gemma-3-4b-it`](https://huggingface.co/google/gemma-3-4b-it)

The Gemma 3 model is a family of multimodal (text+vision) models with 128k context length. The collection can be found [here](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d), with model sizes ranging from 4B to 27B.

We support the Gemma 3 Model in the Rust, Python, and HTTP APIs, including ISQ for increased performance.

The Python and HTTP APIs support sending images as:
- URL
- Path to a local image
- [Base64](https://en.wikipedia.org/wiki/Base64) encoded string

The Rust API takes an image from the [image](https://docs.rs/image/latest/image/index.html) crate.

## HTTP server
You can find this example [here](../examples/server/gemma3.py).

We support an OpenAI compatible HTTP API for vision models. This example demonstrates sending a chat completion request with an image.

> Note: The image_url may be either a path, URL, or a base64 encoded string.

---

**Image:**
<img src="https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg" alt="Mount Washington" width = "1000" height = "666">
<h6><a href = "https://www.nhmagazine.com/mount-washington/">Credit</a></h6>

**Prompt:**
```
image shows Mount Washington in New Hampshire, USA. It's a prominent peak in the White Mountains, known for its extreme weather conditions and being the highest peak in the Northeastern United States. The image captures it covered in snow with a dramatic sky above. The structures at the summit are communication towers.



The winding path visible on the mountain slopes appears to be part of the Mount Washington Auto Road, a historic road that allows vehicles to drive to the summit.
```

**Output:**
```
A mountain with snow on it.
```

---

1) Start the server

> [!NOTE]
> You should replace `--features ...` with one of the features specified [here](../README.md#supported-accelerators), or remove it for pure CPU inference.

```
cargo run --release --features ... -- --port 1234 vision-plain -m google/gemma-3-12b-it -a gemma3
```

2) Send a request

```py
from openai import OpenAI
import httpx
import textwrap
import json


client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")


completion = client.chat.completions.create(
model="gemma3",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)

```

- You can find an example of encoding the [image via base64 here](../examples/server/phi3v_base64.py).
- You can find an example of loading an [image locally here](../examples/server/phi3v_local_img.py).

---

## Rust
You can find this example [here](../mistralrs/examples/gemma3/main.rs).

This is a minimal example of running the Phi 4 Multimodal model with a dummy image.

```rust
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, VisionLoaderType, VisionMessages, VisionModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
let model =
VisionModelBuilder::new("google/gemma-3-12b-it", VisionLoaderType::Gemma3)
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;

let bytes = match reqwest::blocking::get(
"https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;

let messages = VisionMessages::new().add_image_message(
TextMessageRole::User,
"What is depicted here? Please describe the scene in detail.",
image,
&model,
)?;

let response = model.send_chat_request(messages).await?;

println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);

Ok(())
}
```

## Python
You can find this example [here](../examples/python/gemma3.py).

This example demonstrates loading and sending a chat completion request with an image.

> Note: the image_url may be either a path, URL, or a base64 encoded string.

```py
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3-12b-it",
arch=VisionArchitecture.Gemma3,
),
)

res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="gemma3",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)

```

- You can find an example of encoding the [image via base64 here](../examples/python/phi3v_base64.py).
- You can find an example of loading an [image locally here](../examples/python/phi3v_local_img.py).
37 changes: 37 additions & 0 deletions examples/python/gemma3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
which=Which.VisionPlain(
model_id="google/gemma-3-12b-it",
arch=VisionArchitecture.Gemma3,
),
)

res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="gemma3",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
63 changes: 63 additions & 0 deletions examples/server/gemma3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
from openai import OpenAI
import httpx
import textwrap
import json


def log_response(response: httpx.Response):
request = response.request
print(f"Request: {request.method} {request.url}")
print(" Headers:")
for key, value in request.headers.items():
if key.lower() == "authorization":
value = "[...]"
if key.lower() == "cookie":
value = value.split("=")[0] + "=..."
print(f" {key}: {value}")
print(" Body:")
try:
request_body = json.loads(request.content)
print(textwrap.indent(json.dumps(request_body, indent=2), " "))
except json.JSONDecodeError:
print(textwrap.indent(request.content.decode(), " "))
print(f"Response: status_code={response.status_code}")
print(" Headers:")
for key, value in response.headers.items():
if key.lower() == "set-cookie":
value = value.split("=")[0] + "=..."
print(f" {key}: {value}")


client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")

# Enable this to log requests and responses
# client._client = httpx.Client(
# event_hooks={"request": [print], "response": [log_response]}
# )

completion = client.chat.completions.create(
model="gemma3",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
2 changes: 1 addition & 1 deletion mistralrs-core/src/attention.rs
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ fn naive_sdpa(

candle_nn::ops::inplace_attn_softmax_last_dim(
&mut att,
&mask,
&mask.contiguous()?,
sdpa_params.softmax_scale / sdpa_params.softcap.unwrap_or(1.0),
)?;

Expand Down
4 changes: 2 additions & 2 deletions mistralrs-core/src/pipeline/loaders/vision_loaders.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3134,11 +3134,11 @@ impl VisionModelLoader for Gemma3Loader {
fn get_processor(
&self,
_model_config: &str,
_processor_config: Option<ProcessorConfig>,
processor_config: Option<ProcessorConfig>,
_preprocessor_config: PreProcessorConfig,
_max_edge: Option<u32>,
) -> Arc<dyn Processor + Send + Sync> {
Arc::new(Gemma3Processor)
Arc::new(Gemma3Processor::new(processor_config.unwrap()))
}
fn supports_paged_attention(&self) -> bool {
true
Expand Down
4 changes: 4 additions & 0 deletions mistralrs-core/src/vision_models/gemma3/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ use mistralrs_quant::QuantizedConfig;
use crate::{
layers::{Activation, Gemma3RopeScalingConfig},
serde_default_fn,
vision_models::siglip::SiglipVisionConfig,
};

serde_default_fn!(bool, attention_bias, false);
Expand Down Expand Up @@ -63,4 +64,7 @@ pub struct Gemma3TextConfig {
#[derive(Debug, Clone, serde::Deserialize)]
pub struct Gemma3Config {
pub text_config: Gemma3TextConfig,
pub vision_config: SiglipVisionConfig,
pub image_token_index: usize,
pub mm_tokens_per_image: usize,
}
Loading
Loading