THIS PROJECT IS UNDER ACTIVE DEVELOPMENT Its not ready for production usage but serves as a good reference for hwo to run whisper on Qualcomm and Intel NPUs
A portable, packaged OpenAI-compatible server for Windows desktop applications. LLM, Transcription, embeddings, and vector DB, all out of the box.
Note that this does require you to run the .exe as a sepearte async process, like a local serving server in your application, and you will need to make requests to serve inference.
Core Capabilities
- LLM Chat Completions - OpenAI-compatible API with streaming, backed by llama.cpp and OpenVINO
- Audio Transcription - Whisper models with NPU acceleration, backed by OpenVINO and Qualcomm QNN
- Text Embeddings - Vector embeddings for search and RAG
- Vector Database - LanceDB integration for multimodal storage
Hardware Acceleration
- Intel NPU via OpenVINO backend
- Qualcomm NPU via QNN (Snapdragon X Elite)
- Vulkan GPU via llama-cpp
Option A: Download Release
- Download
fluid-server.exefrom releases
Option B: Run from Source
# Install dependencies and run
uv sync
uv run# Run with default settings
.\dist\fluid-server.exe
# Or with custom options
.\dist\fluid-server.exe --host 127.0.0.1 --port 8080- Health Check: http://localhost:8080/health
- API Docs: http://localhost:8080/docs
- Models: http://localhost:8080/v1/models
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-8b-int8-ov", "messages": [{"role": "user", "content": "Hello!"}]}'from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
# Chat with streaming
for chunk in client.chat.completions.create(
model="qwen3-8b-int8-ov",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
):
print(chunk.choices[0].delta.content or "", end="")curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo-qnn"📖 Comprehensive Guides
- NPU Support Guide - Intel & Qualcomm NPU configuration
- Integration Guide - Python, .NET, Node.js examples
- Development Guide - Setup, building, and contributing
- LanceDB Integration - Vector database and embeddings
- GGUF Model Support - Using any GGUF model
- Compilation Guide - Build system details
Why Python? Best ML ecosystem support and PyInstaller packaging.
Why not llama.cpp? We support multiple runtimes and AI accelerators beyond GGML.
Built using ty, FastAPI, Pydantic, ONNX Runtime, OpenAI Whisper, and various other AI libraries.
Runtime Technologies:
OpenVINO- Intel NPU and GPU accelerationQualcomm QNN- Snapdragon NPU optimization with HTP backendONNX Runtime- Cross-platform AI inference