Skip to content

rdumasia303/tensorrt-llm_with_open-webui

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TensorRT-LLM + Open WebUI

A Docker-based solution for running NVIDIA TensorRT-LLM optimized models with Open WebUI. This project was specifically created to enable RTX 5090 and other Blackwell-architecture GPUs to run FP4 quantized models efficiently, addressing challenges with VLLM compatibility.

πŸš€ Why This Project?

  • Blackwell GPU Support: Optimized for RTX 5090 and other Blackwell cards
  • FP4 Model Support: Run FP4 quantized models that may not work well with VLLM
  • Easy Setup: Get up and running with just a few commands
  • OpenAI-Compatible API: Works with Open WebUI and other OpenAI-compatible clients

The Origin Story

Born out of pure frustration at 2am! πŸŒ™β˜•

After spending hours trying to get VLLM working on Blackwell with FP4 models (and failing spectacularly), I had an epiphany: "There MUST be an alternative!" And there wasβ€”NVIDIA's own TensorRT-LLM. Sometimes the solution is hiding in plain sight, you just need enough sleep deprivation to find it.

If you've been wrestling with VLLM on Blackwell GPUs, you're not alone. This project exists so you don't have to stay up till 2am like I did!

🎯 Quick Start

Prerequisites

  • Docker & Docker Compose
  • NVIDIA GPU with appropriate drivers (especially Blackwell-series like RTX 5090)
  • NVIDIA Container Toolkit
  • Hugging Face account and token (only required for gated models like Llama; public models work without it)

Setup

  1. Clone or navigate to this repository

  2. Create your environment file

    cp .env.example .env
  3. Edit .env and set your Hugging Face token

    nano .env  # or use your preferred editor

    Get your token from: https://huggingface.co/settings/tokens

  4. Choose your model (optional)

    Edit the MODEL_NAME in .env or keep the default. Browse available models at:

    πŸ” https://huggingface.co/nvidia

    Look for models optimized for TensorRT-LLM, especially:

    • FP4 quantized models (e.g., models with FP4 in the name)
    • Models tagged with tensorrt-llm
    • Models optimized for inference
  5. Start the services

    docker-compose up -d
  6. Access Open WebUI

    Open your browser and navigate to: http://localhost:3000

πŸ“ Project Structure

tensorrt-openwebui/
β”œβ”€β”€ docker-compose.yaml           # Service configuration
β”œβ”€β”€ .env                          # Your environment variables (create from .env.example)
β”œβ”€β”€ .env.example                  # Template for environment variables
β”œβ”€β”€ extra-llm-api-config.yml     # Multimodal model configuration
β”œβ”€β”€ data/                         # Persistent data (auto-created)
β”‚   β”œβ”€β”€ hf/                      # Hugging Face model data
β”‚   β”œβ”€β”€ hf-cache/                # Hugging Face cache
β”‚   └── trtllm/                  # TensorRT-LLM engine cache
└── README.md                    # This file

πŸ”§ Configuration

Environment Variables

Edit .env to customize your setup:

Variable Description Default
HUGGING_FACE_HUB_TOKEN Your HF token (required for gated models) (empty)
MODEL_NAME The Hugging Face model to serve nvidia/Qwen3-30B-A3B-FP4
MULTIMODAL_CONFIG Enable multimodal model support (set to true for vision/audio models) (empty)
TRTLLM_PORT Port for TensorRT-LLM API 8000
WEBUI_PORT Port for Open WebUI 3000
TLLM_LOG_LEVEL Logging level (DEBUG, INFO, WARNING, ERROR) INFO
WEBUI_AUTH Enable/disable WebUI authentication false

Multimodal Models (Vision, Audio, etc.) - ⚠️ Work in Progress

Status: Multimodal support is still being investigated and is not fully working yet.

According to NVIDIA's documentation, multimodal models require special configuration:

  • TRT-LLM multimodal is not compatible with kv_cache_reuse
  • Multimodal models require chat_template (only Chat API supported)

Current setup (not yet working):

  1. Set MULTIMODAL_CONFIG=true in your .env file
  2. The config file extra-llm-api-config.yml disables block reuse as required
  3. Restart the services

However, in testing, multimodal models like nvidia/Qwen2-VL-7B-Instruct and nvidia/Phi-4-multimodal-instruct-FP4 still don't work properly. Additional configuration or compatibility fixes are needed.

Help wanted! If you figure out how to get multimodal models working with TensorRT-LLM, please open a PR or issue!

Changing Models

To switch models:

  1. Update MODEL_NAME in .env
  2. Restart the services:
    docker-compose down
    docker-compose up -d

πŸ” Finding Models

Browse NVIDIA's TensorRT-LLM optimized models:

https://huggingface.co/nvidia

Recommended Model Types

  • FP4 Quantized: Best for Blackwell GPUs like RTX 5090

    • Example: nvidia/Qwen3-30B-A3B-FP4
  • AWQ/INT4: Alternative quantization formats

  • Original Models: Any model that TensorRT-LLM supports

Model Naming

TensorRT-LLM uses Hugging Face model identifiers:

<organization>/<model-name>

Examples:

  • nvidia/Qwen3-30B-A3B-FP4
  • meta-llama/Meta-Llama-3-8B-Instruct (if you have access)

πŸ§ͺ Models Tested

Here's what we've tested on RTX 5090 (32GB VRAM):

Model Status Notes
nvidia/Qwen3-30B-A3B-FP4 βœ… Works Recommended! Solid performance
nvidia/Qwen2.5-VL-7B-Instruct-FP4 ❌ Didn't work Multimodal - not working yet despite config attempts
nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 ⚠️ Too large 56GB VRAM required! Don't let the "17B" fool youβ€”MoE models are HUGE. Requires HF token.
nvidia/Phi-4-multimodal-instruct-FP4 ❌ Didn't work Multimodal - not working yet despite config attempts

πŸ’‘ Pro Tips:

  • Text-only models work great! Stick with these for now
  • Multimodal models are a work in progress - configuration needs more investigation
  • MoE (Mixture of Experts) models like Llama-Scout have massive VRAM requirements despite "small" parameter counts
  • Always check the model card for actual memory requirements!

Tested your own model? Open a PR and add it to the table!

πŸ–₯️ Usage

First Run

The first time you start the services:

  1. Model Download: TensorRT-LLM will download the model from Hugging Face (can take several minutes depending on model size)
  2. Engine Building: TensorRT will build optimized engines for your GPU (can take 10-30 minutes)
  3. Service Ready: Once complete, the API and WebUI will be available

Monitor the logs:

docker-compose logs -f trtllm

Accessing Services

API Usage

The TensorRT-LLM server exposes an OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tensorrt_llm" \
  -d '{
    "model": "nvidia/Qwen3-30B-A3B-FP4",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ]
  }'

πŸ› οΈ Troubleshooting

Installing NVIDIA Drivers on Ubuntu (Blackwell/RTX 5090)

Note: Getting NVIDIA drivers working on Blackwell GPUs can be a pain! Here's what worked:

The key requirements for RTX 5090 on Ubuntu 24.04:

  1. Use the open-source driver (nvidia-driver-570-open or newer, like nvidia-driver-580-open)
  2. Upgrade to kernel 6.11+ (6.14+ recommended for best stability)
  3. Enable Resize Bar in BIOS/UEFI (critical!)

Step-by-Step Instructions

1. Install NVIDIA Open Driver (580 or newer)

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt remove --purge nvidia*
sudo nvidia-installer --uninstall  # If you have it
sudo apt autoremove
sudo apt install nvidia-driver-580-open

2. Upgrade Linux Kernel to 6.11+ (for Ubuntu 24.04 LTS)

sudo apt install --install-recommends linux-generic-hwe-24.04 linux-headers-generic-hwe-24.04
sudo update-initramfs -u
sudo apt autoremove

3. Reboot

sudo reboot

4. Enable Resize Bar in UEFI/BIOS

  • Restart and enter UEFI (usually F2, Del, or F12 during boot)
  • Find and enable "Resize Bar" or "Smart Access Memory"
  • This will also enable "Above 4G Decoding" and disable "CSM" (Compatibility Support Module)β€”that's expected!
  • Save and exit

5. Verify Installation

nvidia-smi

You should see your RTX 5090 listed!

πŸ’‘ Why open drivers? I dunno, but the open drivers have better support for Blackwell GPUs. Without Resize Bar enabled, you'll get a black screen even with correct drivers!

Credit: Solution adapted from this Reddit thread.

Model Download Issues

  • Ensure your HUGGING_FACE_HUB_TOKEN is set in .env
  • Check that you have access to the model on Hugging Face
  • For gated models, accept the model terms on HF website first

Out of Memory

  • Try a smaller model or one with more aggressive quantization
  • Increase Docker's shared memory: edit shm_size in docker-compose.yaml
  • Check GPU memory usage: nvidia-smi

Slow First Startup

This is normal! Engine building can take 10-30 minutes on first run. Subsequent starts will be much faster as engines are cached in ./data/trtllm/.

Multimodal Models Not Working

Known issue: Multimodal models (vision, audio) are not currently working with this setup, despite following NVIDIA's documentation.

If you want to experiment:

  1. Enable multimodal config: Set MULTIMODAL_CONFIG=true in .env
  2. Clear engine cache: Multimodal models need fresh engines
    docker-compose down
    rm -rf ./data/trtllm/*
    docker-compose up -d
  3. Check logs: Look for specific error messages
    docker-compose logs -f trtllm

For now, stick with text-only models which work reliably. If you solve multimodal support, please share your solution!

Check Logs

# All services
docker-compose logs -f

# Just TensorRT-LLM
docker-compose logs -f trtllm

# Just Open WebUI
docker-compose logs -f openwebui

πŸ”„ Maintenance

Update Images

docker-compose pull
docker-compose up -d

Clean Engine Cache

If you switch models or update TensorRT, clear the engine cache:

# Stop services
docker-compose down

# Remove engine cache
rm -rf ./data/trtllm/*

# Restart
docker-compose up -d

Backup Your Data

Your model cache and engines are stored in ./data/. To backup:

tar -czf tensorrt-backup-$(date +%Y%m%d).tar.gz data/

πŸ“Š Performance Notes

RTX 5090 & Blackwell Cards

This setup is optimized for NVIDIA Blackwell architecture:

  • Native FP4 Support: Hardware-accelerated FP4 computation
  • High Memory Bandwidth: Excellent for large models
  • TensorRT Optimizations: Custom kernels for Blackwell

Why Not VLLM?

While VLLM is excellent, it can have compatibility issues with:

  • Newer GPU architectures (like Blackwell)
  • Certain quantization formats (especially FP4)
  • Rapid driver/CUDA version changes

TensorRT-LLM is NVIDIA's official inference solution and generally has better support for the latest hardware.

πŸ’» Development Rig

This project was developed and tested on:

Hardware:

  • GPU: NVIDIA GeForce RTX 5090 (32GB VRAM)
  • Driver: 580.95.05 (Open Source)
  • CUDA: 13.0

Software:

  • OS: Ubuntu 24.04.3 LTS
  • Kernel: 6.14.0-33-generic
  • Docker with NVIDIA Container Toolkit

🀝 Contributing

Found a better model configuration? Have optimization tips? Tested more models? PRs welcome!

πŸ“ License

This configuration is provided as-is. Respect the licenses of:

  • TensorRT-LLM (NVIDIA)
  • Open WebUI
  • Individual models you download

πŸ™ Credits

Project Creation:

  • Docker Compose configuration: GPT-5 (OpenAI)
  • Documentation & packaging: Claude 4.5 Sonnet (Anthropic)
  • Late-night frustration & vision: The human who refused to give up at 2am!

Built to provide a working alternative to VLLM for Blackwell GPUs and FP4 models. Because sometimes the best solutions come from pure stubbornness! 😀


Happy inferencing! πŸš€

About

A simple docker compose setup that works (in limited testing) on Blackwell cards

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published