-
Notifications
You must be signed in to change notification settings - Fork 324
Description
Feature request
Would it possible to have a cli warm-up that allows to have a warm-up requests before server start for arbitrary models? The types of warm-up requests would be for embed, classify and rerank from what I understand. It would essentially just send a small dummy request so that the first inference request is not slow.
Motivation
Currently I think only the Flash implementation of certain models does an automatic warmup, but it would be nice to have a cli argument can perform a warmup call to the models that are served using TEI.
Currently most models have a very slow first request
For sentence-transformers/all-MiniLM-L6-v2 - 1.6s
{"timestamp":"2025-04-11T17:02:47.952264Z","level":"INFO","message":"Args { model_id: \"/dat*/*****/***-******-*6-v2\", revision: None, tokenization_workers: Some(2), dtype: Some(Float32), pooling: Some(Mean), max_concurrent_requests: 512, max_batch_tokens: 65536, max_batch_requests: None, max_client_batch_size: 128, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: \"0.0.0.0\", port: 8000, uds_path: \"/tmp/text-embeddings-inference-server\", huggingface_hub_cache: Some(\"/root/.cache\"), payload_limit: 2000000, api_key: None, json_output: true, otlp_endpoint: None, otlp_service_name: \"text-embeddings-inference.server\", cors_allow_origin: None }","target":"text_embeddings_router","filename":"router/src/main.rs","line_number":175}
{"timestamp":"2025-04-11T17:02:47.962852Z","level":"WARN","message":"Could not find a Sentence Transformers config","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":184}
{"timestamp":"2025-04-11T17:02:47.962879Z","level":"INFO","message":"Maximum number of tokens per request: 512","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":188}
{"timestamp":"2025-04-11T17:02:47.962896Z","level":"INFO","message":"Starting 2 tokenization workers","target":"text_embeddings_core::tokenization","filename":"core/src/tokenization.rs","line_number":28}
{"timestamp":"2025-04-11T17:02:47.975107Z","level":"INFO","message":"Starting model backend","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":230}
{"timestamp":"2025-04-11T17:02:48.353249Z","level":"INFO","message":"Starting Bert model on Cuda(CudaDevice(DeviceId(1)))","target":"text_embeddings_backend_candle","filename":"backends/candle/src/lib.rs","line_number":275}
{"timestamp":"2025-04-11T17:03:03.652186Z","level":"INFO","message":"Starting HTTP server: 0.0.0.0:8000","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1812}
{"timestamp":"2025-04-11T17:03:03.652213Z","level":"INFO","message":"Ready","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1813}
{"timestamp":"2025-04-11T17:03:07.979256Z","level":"INFO","message":"Success","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":714,"span":{"inference_time":"1.660399408s","queue_time":"400.617µs","tokenization_time":"171.673µs","total_time":"1.661061439s","name":"embed"},"spans":
BAAI/bge-reranker-base - 1.2s
{"timestamp":"2025-04-11T16:17:27.084973Z","level":"INFO","message":"Args { model_id: \"/dat*/*****/***-********-*ase\", revision: None, tokenization_workers: Some(2), dtype: Some(Float32), pooling: None, max_concurrent_requests: 512, max_batch_tokens: 65536, max_batch_requests: None, max_client_batch_size: 128, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: \"0.0.0.0\", port: 8000, uds_path: \"/tmp/text-embeddings-inference-server\", huggingface_hub_cache: Some(\"/root/.cache\"), payload_limit: 2000000, api_key: None, json_output: true, otlp_endpoint: None, otlp_service_name: \"text-embeddings-inference.server\", cors_allow_origin: None }","target":"text_embeddings_router","filename":"router/src/main.rs","line_number":175}
{"timestamp":"2025-04-11T16:17:27.569870Z","level":"WARN","message":"Could not find a Sentence Transformers config","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":184}
{"timestamp":"2025-04-11T16:17:27.569895Z","level":"INFO","message":"Maximum number of tokens per request: 512","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":188}
{"timestamp":"2025-04-11T16:17:27.569918Z","level":"INFO","message":"Starting 2 tokenization workers","target":"text_embeddings_core::tokenization","filename":"core/src/tokenization.rs","line_number":28}
{"timestamp":"2025-04-11T16:17:28.303617Z","level":"INFO","message":"Starting model backend","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":230}
{"timestamp":"2025-04-11T16:17:28.449523Z","level":"INFO","message":"Starting Bert model on Cuda(CudaDevice(DeviceId(1)))","target":"text_embeddings_backend_candle","filename":"backends/candle/src/lib.rs","line_number":297}
{"timestamp":"2025-04-11T16:17:37.199346Z","level":"INFO","message":"Starting HTTP server: 0.0.0.0:8000","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1812}
{"timestamp":"2025-04-11T16:17:37.199367Z","level":"INFO","message":"Ready","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1813}
{"timestamp":"2025-04-11T16:17:39.675045Z","level":"INFO","message":"Success","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":459,"span":{"inference_time":"1.227423716s","queue_time":"243.605µs","tokenization_time":"195.464µs","total_time":"1.227960617s","name":"rerank"},"spans":[{"inference_time":"1.227423716s","queue_time":"243.605µs","tokenization_time":"195.464µs","total_time":"1.227960617s","name":"rerank"}]}
I can provide more examples for models if required.
Your contribution
I can help with testing and verifying the fix if required!
cc @Narsil @alvarobartt