Skip to content

Add resumable model download with retry, timeout, and offline mode#77

Merged
Thump604 merged 1 commit intowaybarrios:mainfrom
janhilgard:feat/resumable-download
Apr 11, 2026
Merged

Add resumable model download with retry, timeout, and offline mode#77
Thump604 merged 1 commit intowaybarrios:mainfrom
janhilgard:feat/resumable-download

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a pre-download step with configurable retry (exponential backoff) and timeout before load_model() is called, so interrupted downloads of large models can be resumed
  • New CLI flags for serve: --download-timeout, --download-retries, --offline
  • New standalone subcommand: vllm-mlx download <model> for pre-warming HF caches (useful for CI/CD)
  • Replaces direct snapshot_download() call in tokenizer fallback path with the new retry-aware wrapper

Motivation

Addresses #75 — HuggingFace downloads hang or fail around 10GB for large models with no way to resume.

Usage

# Download model to cache without starting server
vllm-mlx download mlx-community/Qwen3-Next-80B-A3B-Instruct-6bit

# Serve with custom retry/timeout
vllm-mlx serve <model> --download-timeout 600 --download-retries 5

# Offline mode (only locally cached models)
vllm-mlx serve <model> --offline

Test plan

  • 12 unit tests pass (pytest tests/test_download.py -v)
  • Manual test: vllm-mlx download mlx-community/Qwen3-0.6B-4bit succeeds
  • Manual test: nonexistent model fails with clear error message after retries
  • ruff check and black pass on all changed files

🤖 Generated with Claude Code

@waybarrios
Copy link
Copy Markdown
Owner

@janhilgard
For next time, could you please organize your commits a bit better? Having so many commits in a single PR makes it difficult to review the changes. I recommend squashing them all into one commit for this and future PRs

@janhilgard janhilgard force-pushed the feat/resumable-download branch from 47e726b to 5b9db2b Compare February 13, 2026 08:48
@janhilgard
Copy link
Copy Markdown
Collaborator Author

You're right, sorry about that! I've squashed everything into a single clean commit now.

@janhilgard janhilgard force-pushed the feat/resumable-download branch from 5b9db2b to ee5d6be Compare February 13, 2026 08:51
Large model downloads via huggingface_hub often hang or fail around 10GB.
This adds a pre-download step with configurable retry/timeout before
load_model() is called, so interrupted downloads can be resumed.

New CLI flags for `serve`: --download-timeout, --download-retries, --offline
New subcommand: `vllm-mlx download <model>` for pre-warming caches

Closes waybarrios#75

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard janhilgard force-pushed the feat/resumable-download branch from 8e75792 to a510953 Compare February 15, 2026 17:15
@janhilgard
Copy link
Copy Markdown
Collaborator Author

@waybarrios Gentle ping on this one — I've been using the resumable download in production for a while now and it's been solid. The HF download hangs around 10GB+ are a real pain point (especially with 40-70GB MLX models), and the retry + resume via snapshot_download with resume_download=True handles it well.

Commits are squashed as you requested. Any feedback on the implementation, or is this good to merge?

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 8, 2026

@waybarrios, @janhilgard: brief endorsement plus cross-link.

The retry-with-exponential-backoff plus configurable timeout pattern is the right shape for HuggingFace download reliability. The new --download-timeout, --download-retries, --offline CLI flags and the standalone vllm-mlx download <model> subcommand for pre-warming caches are useful for CI/CD and for systems with intermittent network. Mergeable on current main.

May also partially address issue #134 (IvoLeist, "vlm-mlx serve suddenly gets stuck when getting models from the mlx-community"), if the stuck behavior is download-side rather than load-side. With this PR a stuck download would time out instead of stalling indefinitely.

Last activity Mar 21, ~3 weeks ago.

@janhilgard
Copy link
Copy Markdown
Collaborator Author

@Thump604 Hey, could you take a look at this PR when you get a chance? Thanks!

@Thump604 Thump604 merged commit eff899e into waybarrios:main Apr 11, 2026
7 checks passed
janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Apr 11, 2026
Incorporates 53 upstream commits including:
- O(1) state-machine reasoning parser (PR waybarrios#234)
- Resumable model download (PR waybarrios#77)
- Block-aware prefix cache (PR waybarrios#217)
- Message normalization (PR waybarrios#240)
- Full sampling params (PR waybarrios#258)
- ThinkRouter for Anthropic streaming
- 22 new test files
- License file, docs updates

Conflict resolution: preserved production features
(frequency_penalty conversion, tool markup safety nets,
openai_to_anthropic import) while adopting upstream
improvements (Gemma4 parser rewrite, cleaner logging,
_model_name in streaming chunks).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants