Add resumable model download with retry, timeout, and offline mode by janhilgard · Pull Request #77 · waybarrios/vllm-mlx

janhilgard · 2026-02-12T18:46:37Z

Summary

Adds a pre-download step with configurable retry (exponential backoff) and timeout before load_model() is called, so interrupted downloads of large models can be resumed
New CLI flags for serve: --download-timeout, --download-retries, --offline
New standalone subcommand: vllm-mlx download <model> for pre-warming HF caches (useful for CI/CD)
Replaces direct snapshot_download() call in tokenizer fallback path with the new retry-aware wrapper

Motivation

Addresses #75 — HuggingFace downloads hang or fail around 10GB for large models with no way to resume.

Usage

# Download model to cache without starting server
vllm-mlx download mlx-community/Qwen3-Next-80B-A3B-Instruct-6bit

# Serve with custom retry/timeout
vllm-mlx serve <model> --download-timeout 600 --download-retries 5

# Offline mode (only locally cached models)
vllm-mlx serve <model> --offline

Test plan

12 unit tests pass (pytest tests/test_download.py -v)
Manual test: vllm-mlx download mlx-community/Qwen3-0.6B-4bit succeeds
Manual test: nonexistent model fails with clear error message after retries
ruff check and black pass on all changed files

🤖 Generated with Claude Code

waybarrios · 2026-02-13T04:51:40Z

@janhilgard
For next time, could you please organize your commits a bit better? Having so many commits in a single PR makes it difficult to review the changes. I recommend squashing them all into one commit for this and future PRs

janhilgard · 2026-02-13T08:49:57Z

You're right, sorry about that! I've squashed everything into a single clean commit now.

Large model downloads via huggingface_hub often hang or fail around 10GB. This adds a pre-download step with configurable retry/timeout before load_model() is called, so interrupted downloads can be resumed. New CLI flags for `serve`: --download-timeout, --download-retries, --offline New subcommand: `vllm-mlx download <model>` for pre-warming caches Closes waybarrios#75 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard · 2026-03-21T22:39:01Z

@waybarrios Gentle ping on this one — I've been using the resumable download in production for a while now and it's been solid. The HF download hangs around 10GB+ are a real pain point (especially with 40-70GB MLX models), and the retry + resume via snapshot_download with resume_download=True handles it well.

Commits are squashed as you requested. Any feedback on the implementation, or is this good to merge?

Thump604 · 2026-04-08T00:26:01Z

@waybarrios, @janhilgard: brief endorsement plus cross-link.

The retry-with-exponential-backoff plus configurable timeout pattern is the right shape for HuggingFace download reliability. The new --download-timeout, --download-retries, --offline CLI flags and the standalone vllm-mlx download <model> subcommand for pre-warming caches are useful for CI/CD and for systems with intermittent network. Mergeable on current main.

May also partially address issue #134 (IvoLeist, "vlm-mlx serve suddenly gets stuck when getting models from the mlx-community"), if the stuck behavior is download-side rather than load-side. With this PR a stuck download would time out instead of stalling indefinitely.

Last activity Mar 21, ~3 weeks ago.

janhilgard · 2026-04-11T14:06:43Z

@Thump604 Hey, could you take a look at this PR when you get a chance? Thanks!

Incorporates 53 upstream commits including: - O(1) state-machine reasoning parser (PR waybarrios#234) - Resumable model download (PR waybarrios#77) - Block-aware prefix cache (PR waybarrios#217) - Message normalization (PR waybarrios#240) - Full sampling params (PR waybarrios#258) - ThinkRouter for Anthropic streaming - 22 new test files - License file, docs updates Conflict resolution: preserved production features (frequency_penalty conversion, tool markup safety nets, openai_to_anthropic import) while adopting upstream improvements (Gemma4 parser rewrite, cleaner logging, _model_name in streaming chunks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard force-pushed the feat/resumable-download branch from 47e726b to 5b9db2b Compare February 13, 2026 08:48

janhilgard force-pushed the feat/resumable-download branch from 5b9db2b to ee5d6be Compare February 13, 2026 08:51

janhilgard force-pushed the feat/resumable-download branch from 8e75792 to a510953 Compare February 15, 2026 17:15

Thump604 mentioned this pull request Apr 8, 2026

vlm-mlx serve suddenly gets stuck when getting models from the mlx-community #134

Closed

Thump604 merged commit eff899e into waybarrios:main Apr 11, 2026
7 checks passed

Thump604 mentioned this pull request Apr 11, 2026

fix: import Path in tokenizer utils #283

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resumable model download with retry, timeout, and offline mode#77

Add resumable model download with retry, timeout, and offline mode#77
Thump604 merged 1 commit intowaybarrios:mainfrom
janhilgard:feat/resumable-download

janhilgard commented Feb 12, 2026

Uh oh!

waybarrios commented Feb 13, 2026

Uh oh!

janhilgard commented Feb 13, 2026

Uh oh!

janhilgard commented Mar 21, 2026

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

janhilgard commented Feb 12, 2026

Summary

Motivation

Usage

Test plan

Uh oh!

waybarrios commented Feb 13, 2026

Uh oh!

janhilgard commented Feb 13, 2026

Uh oh!

janhilgard commented Mar 21, 2026

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants