Skip to content

CLI parameter to enable warm-up #580

@vrdn-23

Description

@vrdn-23

Feature request

Would it possible to have a cli warm-up that allows to have a warm-up requests before server start for arbitrary models? The types of warm-up requests would be for embed, classify and rerank from what I understand. It would essentially just send a small dummy request so that the first inference request is not slow.

Motivation

Currently I think only the Flash implementation of certain models does an automatic warmup, but it would be nice to have a cli argument can perform a warmup call to the models that are served using TEI.
Currently most models have a very slow first request

For sentence-transformers/all-MiniLM-L6-v2 - 1.6s

{"timestamp":"2025-04-11T17:02:47.952264Z","level":"INFO","message":"Args { model_id: \"/dat*/*****/***-******-*6-v2\", revision: None, tokenization_workers: Some(2), dtype: Some(Float32), pooling: Some(Mean), max_concurrent_requests: 512, max_batch_tokens: 65536, max_batch_requests: None, max_client_batch_size: 128, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: \"0.0.0.0\", port: 8000, uds_path: \"/tmp/text-embeddings-inference-server\", huggingface_hub_cache: Some(\"/root/.cache\"), payload_limit: 2000000, api_key: None, json_output: true, otlp_endpoint: None, otlp_service_name: \"text-embeddings-inference.server\", cors_allow_origin: None }","target":"text_embeddings_router","filename":"router/src/main.rs","line_number":175}
{"timestamp":"2025-04-11T17:02:47.962852Z","level":"WARN","message":"Could not find a Sentence Transformers config","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":184}
{"timestamp":"2025-04-11T17:02:47.962879Z","level":"INFO","message":"Maximum number of tokens per request: 512","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":188}
{"timestamp":"2025-04-11T17:02:47.962896Z","level":"INFO","message":"Starting 2 tokenization workers","target":"text_embeddings_core::tokenization","filename":"core/src/tokenization.rs","line_number":28}
{"timestamp":"2025-04-11T17:02:47.975107Z","level":"INFO","message":"Starting model backend","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":230}
{"timestamp":"2025-04-11T17:02:48.353249Z","level":"INFO","message":"Starting Bert model on Cuda(CudaDevice(DeviceId(1)))","target":"text_embeddings_backend_candle","filename":"backends/candle/src/lib.rs","line_number":275}
{"timestamp":"2025-04-11T17:03:03.652186Z","level":"INFO","message":"Starting HTTP server: 0.0.0.0:8000","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1812}
{"timestamp":"2025-04-11T17:03:03.652213Z","level":"INFO","message":"Ready","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1813}
{"timestamp":"2025-04-11T17:03:07.979256Z","level":"INFO","message":"Success","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":714,"span":{"inference_time":"1.660399408s","queue_time":"400.617µs","tokenization_time":"171.673µs","total_time":"1.661061439s","name":"embed"},"spans":

BAAI/bge-reranker-base - 1.2s

{"timestamp":"2025-04-11T16:17:27.084973Z","level":"INFO","message":"Args { model_id: \"/dat*/*****/***-********-*ase\", revision: None, tokenization_workers: Some(2), dtype: Some(Float32), pooling: None, max_concurrent_requests: 512, max_batch_tokens: 65536, max_batch_requests: None, max_client_batch_size: 128, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: \"0.0.0.0\", port: 8000, uds_path: \"/tmp/text-embeddings-inference-server\", huggingface_hub_cache: Some(\"/root/.cache\"), payload_limit: 2000000, api_key: None, json_output: true, otlp_endpoint: None, otlp_service_name: \"text-embeddings-inference.server\", cors_allow_origin: None }","target":"text_embeddings_router","filename":"router/src/main.rs","line_number":175}
{"timestamp":"2025-04-11T16:17:27.569870Z","level":"WARN","message":"Could not find a Sentence Transformers config","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":184}
{"timestamp":"2025-04-11T16:17:27.569895Z","level":"INFO","message":"Maximum number of tokens per request: 512","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":188}
{"timestamp":"2025-04-11T16:17:27.569918Z","level":"INFO","message":"Starting 2 tokenization workers","target":"text_embeddings_core::tokenization","filename":"core/src/tokenization.rs","line_number":28}
{"timestamp":"2025-04-11T16:17:28.303617Z","level":"INFO","message":"Starting model backend","target":"text_embeddings_router","filename":"router/src/lib.rs","line_number":230}
{"timestamp":"2025-04-11T16:17:28.449523Z","level":"INFO","message":"Starting Bert model on Cuda(CudaDevice(DeviceId(1)))","target":"text_embeddings_backend_candle","filename":"backends/candle/src/lib.rs","line_number":297}
{"timestamp":"2025-04-11T16:17:37.199346Z","level":"INFO","message":"Starting HTTP server: 0.0.0.0:8000","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1812}
{"timestamp":"2025-04-11T16:17:37.199367Z","level":"INFO","message":"Ready","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":1813}
{"timestamp":"2025-04-11T16:17:39.675045Z","level":"INFO","message":"Success","target":"text_embeddings_router::http::server","filename":"router/src/http/server.rs","line_number":459,"span":{"inference_time":"1.227423716s","queue_time":"243.605µs","tokenization_time":"195.464µs","total_time":"1.227960617s","name":"rerank"},"spans":[{"inference_time":"1.227423716s","queue_time":"243.605µs","tokenization_time":"195.464µs","total_time":"1.227960617s","name":"rerank"}]}

I can provide more examples for models if required.

Your contribution

I can help with testing and verifying the fix if required!
cc @Narsil @alvarobartt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions