Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
f4c8197
feat: added gema 27b
blefo Aug 25, 2025
617a590
feat: implement multimodal content support with image URL validation
blefo Aug 25, 2025
941f784
refactor: added multimodal parameter + web search with image in query…
blefo Aug 25, 2025
a74acbc
refactor: update chat completion message structure and change model t…
blefo Aug 26, 2025
7fe8325
feat: add Docker Compose configuration for gemma-4b in ci pipeline fo…
blefo Aug 26, 2025
ae4422a
fix: ruff format
blefo Aug 26, 2025
7c8b61e
refactor: remove unused import in e2e and unit tests
blefo Aug 26, 2025
92e8ece
test: add rate limit checks to multimodal chat completion tests
blefo Aug 26, 2025
27a91b9
fix: ruff format
blefo Aug 26, 2025
2fcce35
refactor: update message model structure
blefo Aug 26, 2025
3e3cc56
fix: ruff format
blefo Aug 26, 2025
2a9669c
test: enhance chat completion tests with multimodal model integration
blefo Aug 26, 2025
3c62952
fix: web search + multimodal with 3 sources
blefo Aug 26, 2025
a52ec27
chore: stop tracking docker/compose/docker-compose.gemma-4b-gpu.ci.ym…
blefo Aug 26, 2025
ad4fa11
refactor: clean up imports in tests
blefo Aug 26, 2025
f0e5848
fix: add type ignore for role in Message class
blefo Aug 26, 2025
338a5ec
refactor: update Message class content type to use new ChatCompletion…
blefo Aug 26, 2025
e85a711
feat: add content extractor utility for processing text and image con…
blefo Aug 26, 2025
758d809
refactor: improve web search handling and enhance message context wit…
blefo Aug 26, 2025
fc9ea5d
refactor: integrate content extraction into user query handling and e…
blefo Aug 26, 2025
758dc0d
feat: add functions to handle multimodal content and extract the last…
blefo Aug 26, 2025
eaf2e96
refactor: remove deprecated image support handler and enhance multimo…
blefo Aug 26, 2025
7973642
refactor: clean up unused imports in web search and content extractor…
blefo Aug 26, 2025
f9b71cd
feat: implement chat completion tests with image support and error ha…
blefo Aug 26, 2025
746edbf
feat: enhance chat completion tests with rate limit configurations an…
blefo Aug 27, 2025
52e3ad9
refactor: streamline rate limiting logic by removing unused wait_for_…
blefo Aug 27, 2025
b688bc0
refactor: remove unused import of asyncio in rate limiting module
blefo Aug 27, 2025
49515d0
refactor: simplify rate limiting tests by consolidating success and r…
blefo Aug 27, 2025
9d40015
refactor: remove redundant blank line in web search test file
blefo Aug 27, 2025
bbb3d4b
fix: unused imports
blefo Aug 27, 2025
a93a72b
refactor: update type annotations in Message class
blefo Aug 27, 2025
83cd373
fix: ruff
blefo Aug 27, 2025
0b3d622
fix: handle None content in message processing and update content ext…
blefo Aug 27, 2025
3288a36
fix: improve multimodal content handling and simplify logic in chat c…
blefo Aug 27, 2025
d2f24c9
fix: ruff format
blefo Aug 27, 2025
ed8f179
fix: enhance multimodal content detection to specifically check for i…
blefo Aug 27, 2025
af31527
fix: ruff format
blefo Aug 27, 2025
bd462a7
refactor: streamline user query extraction and enhance web search
blefo Aug 27, 2025
d5a11a8
chore: remove gemma model entry from E2E config
blefo Aug 27, 2025
2817955
feat: add gemma-4b-gpu model support and update CI workflow
blefo Aug 29, 2025
3fe98f2
refactor: remove unused import
blefo Aug 29, 2025
2eb2549
test#1: gemma-4 test
blefo Aug 29, 2025
d6979a1
fix: ci yml
blefo Aug 29, 2025
0119c3c
Merge branch 'main' into feat-multimodal-endpoint
blefo Aug 29, 2025
9d60103
fix: ruff check
blefo Aug 29, 2025
328fd64
Merge branch 'feat-multimodal-endpoint' of https://github.com/Nillion…
blefo Aug 29, 2025
ed4735d
fix: ci flag
blefo Aug 29, 2025
37e5709
fix: ci model
blefo Aug 29, 2025
023dec0
fix: ci gemma configuration
blefo Aug 29, 2025
da0602f
test#2: remove llama-1b
blefo Aug 29, 2025
fccbd1d
fix: update the script for gemma
blefo Aug 29, 2025
03ca9eb
fix#2
blefo Aug 29, 2025
359f518
fix: add service startup logs
blefo Aug 29, 2025
967eace
fix: update gemma ci config
blefo Aug 29, 2025
8b5a073
fix: added logs for services
blefo Aug 29, 2025
e9cc0da
fix: gemma config
blefo Aug 29, 2025
3dcd4e5
fix: gemma config
blefo Aug 29, 2025
86c8e0a
fix: gemma config
blefo Aug 29, 2025
cfc2e07
fix: gemma config
blefo Aug 29, 2025
96f78af
fix: gemma config
blefo Aug 29, 2025
f1c7b4d
fix: gemma config
blefo Aug 29, 2025
f4451ca
fix: gemma config
blefo Sep 1, 2025
25bea10
fix: update gemma config
blefo Sep 1, 2025
bd8ba99
fix: gemma config
blefo Sep 1, 2025
eb4dbe0
fix: gemma config
blefo Sep 1, 2025
eb3f3de
fix: use qwen-2b instead of gemma-4b for ci pipeline
blefo Sep 1, 2025
b0f36c6
fix: update qwen config
blefo Sep 1, 2025
7c2b140
fix: qwen config
blefo Sep 1, 2025
4375c03
fix: update qwen config
blefo Sep 1, 2025
471c7cb
fix: config as list
blefo Sep 1, 2025
a842da8
fix: qwen config
blefo Sep 1, 2025
9115580
fix: avoid parsing error
blefo Sep 1, 2025
30fb2c9
fix: qwen config format
blefo Sep 1, 2025
aa43f24
fix: update config
blefo Sep 1, 2025
8f6b06e
fix: update config
blefo Sep 1, 2025
244d14d
fix: enfore eager
blefo Sep 1, 2025
fd9ab43
fix: api model fixes
jcabrero Sep 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/cicd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ jobs:
sed -i 's/BRAVE_SEARCH_API=.*/BRAVE_SEARCH_API=${{ secrets.BRAVE_SEARCH_API }}/' .env

- name: Compose docker-compose.yml
run: python3 ./scripts/docker-composer.py --dev -f docker/compose/docker-compose.llama-1b-gpu.ci.yml -o development-compose.yml
run: python3 ./scripts/docker-composer.py --dev -f docker/compose/docker-compose.llama-1b-cpu.ci.yml -o development-compose.yml

- name: GPU stack versions (non-fatal)
shell: bash
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -179,3 +179,4 @@ private_key.key.lock

development-compose.yml
production-compose.yml
docker/compose/docker-compose.gemma-4b-gpu.ci.yml
4 changes: 0 additions & 4 deletions docker-compose.dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,6 @@ services:
caddy:
env_file:
- .env
ports:
- "80:80"
- "443:443"
- "443:443/udp"
volumes:
- ./caddy/Caddyfile:/etc/caddy/Caddyfile
api:
Expand Down
45 changes: 45 additions & 0 deletions docker/compose/docker-compose.gemma-27b-gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
services:
gemma_27b_gpu:
image: nillion/nilai-vllm:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
ulimits:
memlock: -1
stack: 67108864
env_file:
- .env
restart: unless-stopped
depends_on:
etcd:
condition: service_healthy
command: >
--model google/gemma-3-27b-it
--gpu-memory-utilization 0.79
--max-model-len 60000
--max-num-batched-tokens 8192
--dtype bfloat16
--kv-cache-dtype fp8
--uvicorn-log-level warning
environment:
- SVC_HOST=gemma_27b_gpu
- SVC_PORT=8000
- ETCD_HOST=etcd
- ETCD_PORT=2379
- TOOL_SUPPORT=false
- MULTIMODAL_SUPPORT=true
volumes:
- hugging_face_models:/root/.cache/huggingface
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
retries: 3
start_period: 60s
timeout: 10s
volumes:
hugging_face_models:
47 changes: 47 additions & 0 deletions docker/compose/docker-compose.gemma-4b-gpu.ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
services:
gemma_4b_gpu:
image: nillion/nilai-vllm:latest
container_name: nilai-gemma_4b_gpu
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

ulimits:
memlock: -1
stack: 67108864
env_file:
- .env
restart: unless-stopped
depends_on:
etcd:
condition: service_healthy
command: >
--model google/gemma-3-4b-it
--max-model-len 30000
--max-num-batched-tokens 8192

--uvicorn-log-level warning
environment:
- SVC_HOST=gemma_4b_gpu
- SVC_PORT=8000
- ETCD_HOST=etcd
- ETCD_PORT=2379
- TOOL_SUPPORT=false
- MULTIMODAL_SUPPORT=true
- CUDA_LAUNCH_BLOCKING=1
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
volumes:
- hugging_face_models:/root/.cache/huggingface
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
retries: 3
start_period: 60s
timeout: 10s
volumes:
hugging_face_models:
2 changes: 1 addition & 1 deletion docker/compose/docker-compose.llama-8b-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ services:
condition: service_healthy
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--gpu-memory-utilization 0.21
--gpu-memory-utilization 0.20
--max-model-len 10000
--max-num-batched-tokens 10000
--tensor-parallel-size 1
Expand Down
64 changes: 64 additions & 0 deletions docker/compose/docker-compose.qwen-2b-gpu.ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
version: "3.8"

services:
c:
image: nillion/nilai-vllm:latest
container_name: qwen2vl_2b_gpu
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ulimits:
memlock: -1
stack: 67108864
env_file:
- .env
restart: unless-stopped
depends_on:
etcd:
condition: service_healthy
command:
[
"--model", "Qwen/Qwen2-VL-2B-Instruct-AWQ",
"--model-impl", "vllm",
"--tensor-parallel-size", "1",
"--trust-remote-code",
"--quantization", "awq",

"--max-model-len", "1280",
"--max-num-batched-tokens", "1280",
"--max-num-seqs", "1",

"--gpu-memory-utilization", "0.75",
"--swap-space", "8",
"--uvicorn-log-level", "warning",

"--limit-mm-per-prompt", "{\"image\":1,\"video\":0}",
"--skip-mm-profiling",
"--enforce-eager"
]

environment:
SVC_HOST: qwen2vl_2b_gpu
SVC_PORT: "8000"
ETCD_HOST: etcd
ETCD_PORT: "2379"
TOOL_SUPPORT: "true"
MULTIMODAL_SUPPORT: "true"
CUDA_LAUNCH_BLOCKING: "1"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
volumes:
- hugging_face_models:/root/.cache/huggingface
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
retries: 3
start_period: 60s
timeout: 10s

volumes:
hugging_face_models:
29 changes: 15 additions & 14 deletions nilai-api/src/nilai_api/handlers/nilrag.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import logging
from typing import Union

import nilrag

from nilai_common import ChatRequest, Message
from nilai_common import ChatRequest, MessageAdapter
from fastapi import HTTPException, status
from sentence_transformers import SentenceTransformer
from typing import Union

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -63,13 +63,9 @@ async def handle_nilrag(req: ChatRequest):

# Get user query
logger.debug("Extracting user query")
query = None
for message in req.messages:
if message.role == "user":
query = message.content
break
query = req.get_last_user_query()

if query is None:
if not query:
raise HTTPException(status_code=400, detail="No user query found")

# Get number of chunks to include
Expand All @@ -85,20 +81,25 @@ async def handle_nilrag(req: ChatRequest):
relevant_context = f"\n\nRelevant Context:\n{formatted_results}"

# Step 4: Update system message
for message in req.messages:
for message in req.adapted_messages:
if message.role == "system":
if message.content is None:
content = message.content
if content is None:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail="system message is empty",
)
message.content += (
relevant_context # Append the context to the system message
)

if isinstance(content, str):
message.content = content + relevant_context
elif isinstance(content, list):
content.append({"type": "text", "text": relevant_context})
break
else:
# If no system message exists, add one
req.messages.insert(0, Message(role="system", content=relevant_context))
req.messages.insert(
0, MessageAdapter.new_message(role="system", content=relevant_context)
)

logger.debug(f"System message updated with relevant context:\n {req.messages}")

Expand Down
Loading
Loading