Add resumable model download with retry, timeout, and offline mode#77
Conversation
|
@janhilgard |
47e726b to
5b9db2b
Compare
|
You're right, sorry about that! I've squashed everything into a single clean commit now. |
5b9db2b to
ee5d6be
Compare
Large model downloads via huggingface_hub often hang or fail around 10GB. This adds a pre-download step with configurable retry/timeout before load_model() is called, so interrupted downloads can be resumed. New CLI flags for `serve`: --download-timeout, --download-retries, --offline New subcommand: `vllm-mlx download <model>` for pre-warming caches Closes waybarrios#75 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8e75792 to
a510953
Compare
|
@waybarrios Gentle ping on this one — I've been using the resumable download in production for a while now and it's been solid. The HF download hangs around 10GB+ are a real pain point (especially with 40-70GB MLX models), and the retry + resume via Commits are squashed as you requested. Any feedback on the implementation, or is this good to merge? |
|
@waybarrios, @janhilgard: brief endorsement plus cross-link. The retry-with-exponential-backoff plus configurable timeout pattern is the right shape for HuggingFace download reliability. The new May also partially address issue #134 (IvoLeist, "vlm-mlx serve suddenly gets stuck when getting models from the mlx-community"), if the stuck behavior is download-side rather than load-side. With this PR a stuck download would time out instead of stalling indefinitely. Last activity Mar 21, ~3 weeks ago. |
|
@Thump604 Hey, could you take a look at this PR when you get a chance? Thanks! |
Incorporates 53 upstream commits including: - O(1) state-machine reasoning parser (PR waybarrios#234) - Resumable model download (PR waybarrios#77) - Block-aware prefix cache (PR waybarrios#217) - Message normalization (PR waybarrios#240) - Full sampling params (PR waybarrios#258) - ThinkRouter for Anthropic streaming - 22 new test files - License file, docs updates Conflict resolution: preserved production features (frequency_penalty conversion, tool markup safety nets, openai_to_anthropic import) while adopting upstream improvements (Gemma4 parser rewrite, cleaner logging, _model_name in streaming chunks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
load_model()is called, so interrupted downloads of large models can be resumedserve:--download-timeout,--download-retries,--offlinevllm-mlx download <model>for pre-warming HF caches (useful for CI/CD)snapshot_download()call in tokenizer fallback path with the new retry-aware wrapperMotivation
Addresses #75 — HuggingFace downloads hang or fail around 10GB for large models with no way to resume.
Usage
Test plan
pytest tests/test_download.py -v)vllm-mlx download mlx-community/Qwen3-0.6B-4bitsucceedsruff checkandblackpass on all changed files🤖 Generated with Claude Code