TensorRT-LLM + Open WebUI

A Docker-based solution for running NVIDIA TensorRT-LLM optimized models with Open WebUI. This project was specifically created to enable RTX 5090 and other Blackwell-architecture GPUs to run FP4 quantized models efficiently, addressing challenges with VLLM compatibility.

🚀 Why This Project?

Blackwell GPU Support: Optimized for RTX 5090 and other Blackwell cards
FP4 Model Support: Run FP4 quantized models that may not work well with VLLM
Easy Setup: Get up and running with just a few commands
OpenAI-Compatible API: Works with Open WebUI and other OpenAI-compatible clients

The Origin Story

Born out of pure frustration at 2am! 🌙☕

After spending hours trying to get VLLM working on Blackwell with FP4 models (and failing spectacularly), I had an epiphany: "There MUST be an alternative!" And there was—NVIDIA's own TensorRT-LLM. Sometimes the solution is hiding in plain sight, you just need enough sleep deprivation to find it.

If you've been wrestling with VLLM on Blackwell GPUs, you're not alone. This project exists so you don't have to stay up till 2am like I did!

🎯 Quick Start

Prerequisites

Docker & Docker Compose
NVIDIA GPU with appropriate drivers (especially Blackwell-series like RTX 5090)
NVIDIA Container Toolkit
Hugging Face account and token (only required for gated models like Llama; public models work without it)

Setup

Clone or navigate to this repository
Create your environment file
```
cp .env.example .env
```
Edit .env and set your Hugging Face token
```
nano .env  # or use your preferred editor
```
Get your token from: https://huggingface.co/settings/tokens
Choose your model (optional)

Edit the MODEL_NAME in .env or keep the default. Browse available models at:

🔍 https://huggingface.co/nvidia

Look for models optimized for TensorRT-LLM, especially:
- FP4 quantized models (e.g., models with FP4 in the name)
- Models tagged with tensorrt-llm
- Models optimized for inference
Start the services
```
docker-compose up -d
```
Access Open WebUI

Open your browser and navigate to: http://localhost:3000

📁 Project Structure

tensorrt-openwebui/
├── docker-compose.yaml           # Service configuration
├── .env                          # Your environment variables (create from .env.example)
├── .env.example                  # Template for environment variables
├── extra-llm-api-config.yml     # Multimodal model configuration
├── data/                         # Persistent data (auto-created)
│   ├── hf/                      # Hugging Face model data
│   ├── hf-cache/                # Hugging Face cache
│   └── trtllm/                  # TensorRT-LLM engine cache
└── README.md                    # This file

🔧 Configuration

Environment Variables

Edit .env to customize your setup:

Variable	Description	Default
`HUGGING_FACE_HUB_TOKEN`	Your HF token (required for gated models)	(empty)
`MODEL_NAME`	The Hugging Face model to serve	`nvidia/Qwen3-30B-A3B-FP4`
`MULTIMODAL_CONFIG`	Enable multimodal model support (set to `true` for vision/audio models)	(empty)
`TRTLLM_PORT`	Port for TensorRT-LLM API	`8000`
`WEBUI_PORT`	Port for Open WebUI	`3000`
`TLLM_LOG_LEVEL`	Logging level (DEBUG, INFO, WARNING, ERROR)	`INFO`
`WEBUI_AUTH`	Enable/disable WebUI authentication	`false`

Multimodal Models (Vision, Audio, etc.) - ⚠️ Work in Progress

Status: Multimodal support is still being investigated and is not fully working yet.

According to NVIDIA's documentation, multimodal models require special configuration:

TRT-LLM multimodal is not compatible with kv_cache_reuse
Multimodal models require chat_template (only Chat API supported)

Current setup (not yet working):

Set MULTIMODAL_CONFIG=true in your .env file
The config file extra-llm-api-config.yml disables block reuse as required
Restart the services

However, in testing, multimodal models like nvidia/Qwen2-VL-7B-Instruct and nvidia/Phi-4-multimodal-instruct-FP4 still don't work properly. Additional configuration or compatibility fixes are needed.

Help wanted! If you figure out how to get multimodal models working with TensorRT-LLM, please open a PR or issue!

Changing Models

To switch models:

Update MODEL_NAME in .env

Restart the services:

docker-compose down
docker-compose up -d

🔍 Finding Models

Browse NVIDIA's TensorRT-LLM optimized models:

https://huggingface.co/nvidia

Recommended Model Types

FP4 Quantized: Best for Blackwell GPUs like RTX 5090
- Example: nvidia/Qwen3-30B-A3B-FP4
AWQ/INT4: Alternative quantization formats
Original Models: Any model that TensorRT-LLM supports

Model Naming

TensorRT-LLM uses Hugging Face model identifiers:

<organization>/<model-name>

Examples:

nvidia/Qwen3-30B-A3B-FP4
meta-llama/Meta-Llama-3-8B-Instruct (if you have access)

🧪 Models Tested

Here's what we've tested on RTX 5090 (32GB VRAM):

Model	Status	Notes
`nvidia/Qwen3-30B-A3B-FP4`	✅ Works	Recommended! Solid performance
`nvidia/Qwen2.5-VL-7B-Instruct-FP4`	❌ Didn't work	Multimodal - not working yet despite config attempts
`nvidia/Llama-4-Scout-17B-16E-Instruct-FP4`	⚠️ Too large	56GB VRAM required! Don't let the "17B" fool you—MoE models are HUGE. Requires HF token.
`nvidia/Phi-4-multimodal-instruct-FP4`	❌ Didn't work	Multimodal - not working yet despite config attempts

💡 Pro Tips:

Text-only models work great! Stick with these for now
Multimodal models are a work in progress - configuration needs more investigation
MoE (Mixture of Experts) models like Llama-Scout have massive VRAM requirements despite "small" parameter counts
Always check the model card for actual memory requirements!

Tested your own model? Open a PR and add it to the table!

🖥️ Usage

First Run

The first time you start the services:

Model Download: TensorRT-LLM will download the model from Hugging Face (can take several minutes depending on model size)
Engine Building: TensorRT will build optimized engines for your GPU (can take 10-30 minutes)
Service Ready: Once complete, the API and WebUI will be available

Monitor the logs:

docker-compose logs -f trtllm

Accessing Services

Open WebUI: http://localhost:3000
TensorRT-LLM API: http://localhost:8000
API Documentation: http://localhost:8000/docs

API Usage

The TensorRT-LLM server exposes an OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tensorrt_llm" \
  -d '{
    "model": "nvidia/Qwen3-30B-A3B-FP4",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ]
  }'

🛠️ Troubleshooting

Installing NVIDIA Drivers on Ubuntu (Blackwell/RTX 5090)

Note: Getting NVIDIA drivers working on Blackwell GPUs can be a pain! Here's what worked:

The key requirements for RTX 5090 on Ubuntu 24.04:

Use the open-source driver (nvidia-driver-570-open or newer, like nvidia-driver-580-open)
Upgrade to kernel 6.11+ (6.14+ recommended for best stability)
Enable Resize Bar in BIOS/UEFI (critical!)

Step-by-Step Instructions

1. Install NVIDIA Open Driver (580 or newer)

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt remove --purge nvidia*
sudo nvidia-installer --uninstall  # If you have it
sudo apt autoremove
sudo apt install nvidia-driver-580-open

2. Upgrade Linux Kernel to 6.11+ (for Ubuntu 24.04 LTS)

sudo apt install --install-recommends linux-generic-hwe-24.04 linux-headers-generic-hwe-24.04
sudo update-initramfs -u
sudo apt autoremove

3. Reboot

sudo reboot

4. Enable Resize Bar in UEFI/BIOS

Restart and enter UEFI (usually F2, Del, or F12 during boot)
Find and enable "Resize Bar" or "Smart Access Memory"
This will also enable "Above 4G Decoding" and disable "CSM" (Compatibility Support Module)—that's expected!
Save and exit

5. Verify Installation

nvidia-smi

You should see your RTX 5090 listed!

💡 Why open drivers? I dunno, but the open drivers have better support for Blackwell GPUs. Without Resize Bar enabled, you'll get a black screen even with correct drivers!

Credit: Solution adapted from this Reddit thread.

Model Download Issues

Ensure your HUGGING_FACE_HUB_TOKEN is set in .env
Check that you have access to the model on Hugging Face
For gated models, accept the model terms on HF website first

Out of Memory

Try a smaller model or one with more aggressive quantization
Increase Docker's shared memory: edit shm_size in docker-compose.yaml
Check GPU memory usage: nvidia-smi

Slow First Startup

This is normal! Engine building can take 10-30 minutes on first run. Subsequent starts will be much faster as engines are cached in ./data/trtllm/.

Multimodal Models Not Working

Known issue: Multimodal models (vision, audio) are not currently working with this setup, despite following NVIDIA's documentation.

If you want to experiment:

Enable multimodal config: Set MULTIMODAL_CONFIG=true in .env

Clear engine cache: Multimodal models need fresh engines

docker-compose down
rm -rf ./data/trtllm/*
docker-compose up -d

Check logs: Look for specific error messages
```
docker-compose logs -f trtllm
```

For now, stick with text-only models which work reliably. If you solve multimodal support, please share your solution!

Check Logs

# All services
docker-compose logs -f

# Just TensorRT-LLM
docker-compose logs -f trtllm

# Just Open WebUI
docker-compose logs -f openwebui

🔄 Maintenance

Update Images

docker-compose pull
docker-compose up -d

Clean Engine Cache

If you switch models or update TensorRT, clear the engine cache:

# Stop services
docker-compose down

# Remove engine cache
rm -rf ./data/trtllm/*

# Restart
docker-compose up -d

Backup Your Data

Your model cache and engines are stored in ./data/. To backup:

tar -czf tensorrt-backup-$(date +%Y%m%d).tar.gz data/

📊 Performance Notes

RTX 5090 & Blackwell Cards

This setup is optimized for NVIDIA Blackwell architecture:

Native FP4 Support: Hardware-accelerated FP4 computation
High Memory Bandwidth: Excellent for large models
TensorRT Optimizations: Custom kernels for Blackwell

Why Not VLLM?

While VLLM is excellent, it can have compatibility issues with:

Newer GPU architectures (like Blackwell)
Certain quantization formats (especially FP4)
Rapid driver/CUDA version changes

TensorRT-LLM is NVIDIA's official inference solution and generally has better support for the latest hardware.

💻 Development Rig

This project was developed and tested on:

Hardware:

GPU: NVIDIA GeForce RTX 5090 (32GB VRAM)
Driver: 580.95.05 (Open Source)
CUDA: 13.0

Software:

OS: Ubuntu 24.04.3 LTS
Kernel: 6.14.0-33-generic
Docker with NVIDIA Container Toolkit

🤝 Contributing

Found a better model configuration? Have optimization tips? Tested more models? PRs welcome!

📝 License

This configuration is provided as-is. Respect the licenses of:

TensorRT-LLM (NVIDIA)
Open WebUI
Individual models you download

🙏 Credits

Project Creation:

Docker Compose configuration: GPT-5 (OpenAI)
Documentation & packaging: Claude 4.5 Sonnet (Anthropic)
Late-night frustration & vision: The human who refused to give up at 2am!

Built to provide a working alternative to VLLM for Blackwell GPUs and FP4 models. Because sometimes the best solutions come from pure stubbornness! 😤

Happy inferencing! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.env.example		.env.example
README.md		README.md
docker-compose.yaml		docker-compose.yaml
extra-llm-api-config.yml		extra-llm-api-config.yml

rdumasia303/tensorrt-llm_with_open-webui

Folders and files

Latest commit

History

Repository files navigation