A Docker-based solution for running NVIDIA TensorRT-LLM optimized models with Open WebUI. This project was specifically created to enable RTX 5090 and other Blackwell-architecture GPUs to run FP4 quantized models efficiently, addressing challenges with VLLM compatibility.
- Blackwell GPU Support: Optimized for RTX 5090 and other Blackwell cards
- FP4 Model Support: Run FP4 quantized models that may not work well with VLLM
- Easy Setup: Get up and running with just a few commands
- OpenAI-Compatible API: Works with Open WebUI and other OpenAI-compatible clients
Born out of pure frustration at 2am! πβ
After spending hours trying to get VLLM working on Blackwell with FP4 models (and failing spectacularly), I had an epiphany: "There MUST be an alternative!" And there wasβNVIDIA's own TensorRT-LLM. Sometimes the solution is hiding in plain sight, you just need enough sleep deprivation to find it.
If you've been wrestling with VLLM on Blackwell GPUs, you're not alone. This project exists so you don't have to stay up till 2am like I did!
- Docker & Docker Compose
- NVIDIA GPU with appropriate drivers (especially Blackwell-series like RTX 5090)
- NVIDIA Container Toolkit
- Hugging Face account and token (only required for gated models like Llama; public models work without it)
-
Clone or navigate to this repository
-
Create your environment file
cp .env.example .env
-
Edit
.envand set your Hugging Face tokennano .env # or use your preferred editorGet your token from: https://huggingface.co/settings/tokens
-
Choose your model (optional)
Edit the
MODEL_NAMEin.envor keep the default. Browse available models at:π https://huggingface.co/nvidia
Look for models optimized for TensorRT-LLM, especially:
- FP4 quantized models (e.g., models with
FP4in the name) - Models tagged with
tensorrt-llm - Models optimized for inference
- FP4 quantized models (e.g., models with
-
Start the services
docker-compose up -d
-
Access Open WebUI
Open your browser and navigate to: http://localhost:3000
tensorrt-openwebui/
βββ docker-compose.yaml # Service configuration
βββ .env # Your environment variables (create from .env.example)
βββ .env.example # Template for environment variables
βββ extra-llm-api-config.yml # Multimodal model configuration
βββ data/ # Persistent data (auto-created)
β βββ hf/ # Hugging Face model data
β βββ hf-cache/ # Hugging Face cache
β βββ trtllm/ # TensorRT-LLM engine cache
βββ README.md # This file
Edit .env to customize your setup:
| Variable | Description | Default |
|---|---|---|
HUGGING_FACE_HUB_TOKEN |
Your HF token (required for gated models) | (empty) |
MODEL_NAME |
The Hugging Face model to serve | nvidia/Qwen3-30B-A3B-FP4 |
MULTIMODAL_CONFIG |
Enable multimodal model support (set to true for vision/audio models) |
(empty) |
TRTLLM_PORT |
Port for TensorRT-LLM API | 8000 |
WEBUI_PORT |
Port for Open WebUI | 3000 |
TLLM_LOG_LEVEL |
Logging level (DEBUG, INFO, WARNING, ERROR) | INFO |
WEBUI_AUTH |
Enable/disable WebUI authentication | false |
Status: Multimodal support is still being investigated and is not fully working yet.
According to NVIDIA's documentation, multimodal models require special configuration:
- TRT-LLM multimodal is not compatible with
kv_cache_reuse - Multimodal models require
chat_template(only Chat API supported)
Current setup (not yet working):
- Set
MULTIMODAL_CONFIG=truein your.envfile - The config file
extra-llm-api-config.ymldisables block reuse as required - Restart the services
However, in testing, multimodal models like nvidia/Qwen2-VL-7B-Instruct and nvidia/Phi-4-multimodal-instruct-FP4 still don't work properly. Additional configuration or compatibility fixes are needed.
Help wanted! If you figure out how to get multimodal models working with TensorRT-LLM, please open a PR or issue!
To switch models:
- Update
MODEL_NAMEin.env - Restart the services:
docker-compose down docker-compose up -d
Browse NVIDIA's TensorRT-LLM optimized models:
-
FP4 Quantized: Best for Blackwell GPUs like RTX 5090
- Example:
nvidia/Qwen3-30B-A3B-FP4
- Example:
-
AWQ/INT4: Alternative quantization formats
-
Original Models: Any model that TensorRT-LLM supports
TensorRT-LLM uses Hugging Face model identifiers:
<organization>/<model-name>
Examples:
nvidia/Qwen3-30B-A3B-FP4meta-llama/Meta-Llama-3-8B-Instruct(if you have access)
Here's what we've tested on RTX 5090 (32GB VRAM):
| Model | Status | Notes |
|---|---|---|
nvidia/Qwen3-30B-A3B-FP4 |
β Works | Recommended! Solid performance |
nvidia/Qwen2.5-VL-7B-Instruct-FP4 |
β Didn't work | Multimodal - not working yet despite config attempts |
nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 |
56GB VRAM required! Don't let the "17B" fool youβMoE models are HUGE. Requires HF token. | |
nvidia/Phi-4-multimodal-instruct-FP4 |
β Didn't work | Multimodal - not working yet despite config attempts |
π‘ Pro Tips:
- Text-only models work great! Stick with these for now
- Multimodal models are a work in progress - configuration needs more investigation
- MoE (Mixture of Experts) models like Llama-Scout have massive VRAM requirements despite "small" parameter counts
- Always check the model card for actual memory requirements!
Tested your own model? Open a PR and add it to the table!
The first time you start the services:
- Model Download: TensorRT-LLM will download the model from Hugging Face (can take several minutes depending on model size)
- Engine Building: TensorRT will build optimized engines for your GPU (can take 10-30 minutes)
- Service Ready: Once complete, the API and WebUI will be available
Monitor the logs:
docker-compose logs -f trtllm- Open WebUI: http://localhost:3000
- TensorRT-LLM API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
The TensorRT-LLM server exposes an OpenAI-compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tensorrt_llm" \
-d '{
"model": "nvidia/Qwen3-30B-A3B-FP4",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
]
}'Note: Getting NVIDIA drivers working on Blackwell GPUs can be a pain! Here's what worked:
The key requirements for RTX 5090 on Ubuntu 24.04:
- Use the open-source driver (
nvidia-driver-570-openor newer, likenvidia-driver-580-open) - Upgrade to kernel 6.11+ (6.14+ recommended for best stability)
- Enable Resize Bar in BIOS/UEFI (critical!)
1. Install NVIDIA Open Driver (580 or newer)
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt remove --purge nvidia*
sudo nvidia-installer --uninstall # If you have it
sudo apt autoremove
sudo apt install nvidia-driver-580-open2. Upgrade Linux Kernel to 6.11+ (for Ubuntu 24.04 LTS)
sudo apt install --install-recommends linux-generic-hwe-24.04 linux-headers-generic-hwe-24.04
sudo update-initramfs -u
sudo apt autoremove3. Reboot
sudo reboot4. Enable Resize Bar in UEFI/BIOS
- Restart and enter UEFI (usually F2, Del, or F12 during boot)
- Find and enable "Resize Bar" or "Smart Access Memory"
- This will also enable "Above 4G Decoding" and disable "CSM" (Compatibility Support Module)βthat's expected!
- Save and exit
5. Verify Installation
nvidia-smiYou should see your RTX 5090 listed!
π‘ Why open drivers? I dunno, but the open drivers have better support for Blackwell GPUs. Without Resize Bar enabled, you'll get a black screen even with correct drivers!
Credit: Solution adapted from this Reddit thread.
- Ensure your
HUGGING_FACE_HUB_TOKENis set in.env - Check that you have access to the model on Hugging Face
- For gated models, accept the model terms on HF website first
- Try a smaller model or one with more aggressive quantization
- Increase Docker's shared memory: edit
shm_sizeindocker-compose.yaml - Check GPU memory usage:
nvidia-smi
This is normal! Engine building can take 10-30 minutes on first run. Subsequent starts will be much faster as engines are cached in ./data/trtllm/.
Known issue: Multimodal models (vision, audio) are not currently working with this setup, despite following NVIDIA's documentation.
If you want to experiment:
- Enable multimodal config: Set
MULTIMODAL_CONFIG=truein.env - Clear engine cache: Multimodal models need fresh engines
docker-compose down rm -rf ./data/trtllm/* docker-compose up -d - Check logs: Look for specific error messages
docker-compose logs -f trtllm
For now, stick with text-only models which work reliably. If you solve multimodal support, please share your solution!
# All services
docker-compose logs -f
# Just TensorRT-LLM
docker-compose logs -f trtllm
# Just Open WebUI
docker-compose logs -f openwebuidocker-compose pull
docker-compose up -dIf you switch models or update TensorRT, clear the engine cache:
# Stop services
docker-compose down
# Remove engine cache
rm -rf ./data/trtllm/*
# Restart
docker-compose up -dYour model cache and engines are stored in ./data/. To backup:
tar -czf tensorrt-backup-$(date +%Y%m%d).tar.gz data/This setup is optimized for NVIDIA Blackwell architecture:
- Native FP4 Support: Hardware-accelerated FP4 computation
- High Memory Bandwidth: Excellent for large models
- TensorRT Optimizations: Custom kernels for Blackwell
While VLLM is excellent, it can have compatibility issues with:
- Newer GPU architectures (like Blackwell)
- Certain quantization formats (especially FP4)
- Rapid driver/CUDA version changes
TensorRT-LLM is NVIDIA's official inference solution and generally has better support for the latest hardware.
This project was developed and tested on:
Hardware:
- GPU: NVIDIA GeForce RTX 5090 (32GB VRAM)
- Driver: 580.95.05 (Open Source)
- CUDA: 13.0
Software:
- OS: Ubuntu 24.04.3 LTS
- Kernel: 6.14.0-33-generic
- Docker with NVIDIA Container Toolkit
Found a better model configuration? Have optimization tips? Tested more models? PRs welcome!
This configuration is provided as-is. Respect the licenses of:
- TensorRT-LLM (NVIDIA)
- Open WebUI
- Individual models you download
Project Creation:
- Docker Compose configuration: GPT-5 (OpenAI)
- Documentation & packaging: Claude 4.5 Sonnet (Anthropic)
- Late-night frustration & vision: The human who refused to give up at 2am!
Built to provide a working alternative to VLLM for Blackwell GPUs and FP4 models. Because sometimes the best solutions come from pure stubbornness! π€
Happy inferencing! π