-
Notifications
You must be signed in to change notification settings - Fork 323
Open
Milestone
Description
System Info
OS version
- Operating System: Amazon Linux 2023.8.20250915
- CPE OS Name: cpe:2.3:o:amazon:amazon_linux:2023
- Kernel: Linux 6.1.150-174.273.amzn2023.x86_64
- Architecture: x86-64
- Hardware Vendor: Amazon EC2
- Hardware Model: g4dn.12xlarge
GPU
1x Tesla T4
Model
Qwen/Qwen3-Embedding-0.6B
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
When running Qwen/Qwen3-Embedding-0.6B with nvidia turing docker image (i.e. ghcr.io/huggingface/text-embeddings-inference:turing-1.8.3) I run into CUDA_ERROR_OUT_OF_MEMORY error. However, when using the generic CUDA Docker image (i.e. ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.3) it works correctly.
- Error with:
docker run --gpus all -e MODEL_ID=Qwen/Qwen3-Embedding-0.6B -e AUTO_TRUNCATE=true ghcr.io/huggingface/text-embeddings-inference:turing-1.8.3 - Works with:
docker run --gpus all -e MODEL_ID=Qwen/Qwen3-Embedding-0.6B -e AUTO_TRUNCATE=true ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.3
Error: Model backend is not healthy
Caused by:
DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
Expected behavior
TEI to launch correctly using docker run --gpus all -e MODEL_ID=Qwen/Qwen3-Embedding-0.6B -e AUTO_TRUNCATE=true ghcr.io/huggingface/text-embeddings-inference:turing-1.8.3 on a Tesla T4 GPU.
Metadata
Metadata
Assignees
Labels
No labels