Layered Dockerfile for smaller size and faster image pulling#22377
Layered Dockerfile for smaller size and faster image pulling#22377aoyshi wants to merge 4 commits into
Conversation
Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This PR introduces a layered Dockerfile to create a smaller and faster-pulling image, which is a great initiative. The approach of using a multi-stage build with a slim base image and layering dependencies is effective. My review focuses on some critical correctness and maintainability issues that should be addressed to make this Dockerfile robust and production-ready. The main concerns are an incompatible Python wheel, improper signal handling in the entrypoint, and opportunities to further reduce image size and improve maintainability.
Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>
Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>
|
Hi @aoyshi, thank you for this PR. A better approach will be directly make the image slim, can you clear these caches in |
|
hi @aoyshi Dockerfile |
|
some info about installing flash infer: #1454 https://github.com/substratusai/vllm-docker/blob/main/Dockerfile.cuda-arm |
address PR feedback (install flashinfer and torch-c-dlpack-ext, update vllm to latest) Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
UPDATE (12/10/2025)
Addressed PR feedback:
Size of official out-of-the-box
vllm-openai:latestdocker image:28.6 GBSize of our slim docker image, after adding FlashInfer and updating to latest vllm (0.12.0), the new image size:
22 GBOverview
This PR introduces a smaller, layered alternative to the existing vLLM project's Dockerfile.
We propose that this optimized Dockerfile be made available as part of the vLLM project.
For production environments, a smaller Docker image that is pulled quickly can help with faster scale-up of new instances. We think that the proposed optimizations can also be helpful for use cases of vLLM Production Stack.
We chose to create a separate
Dockerfile.sliminstead of editing the main Dockerfile, since this method is for a more targetted use-case prioritizing a smaller size and faster image pull. We are open to discussing how to better merge these optimizations with the existing Dockerfile if possible, for a lighter production-ready version of the image that is more compatible with wider architectures.Before the changes: Using existing Dockerfile in vLLM repo:
vllm-otb-img (where otb stands for 'Out-of-The-Box'):
Size:
21.2 GBnvidia/cudabase image.--no-cache-dirfor pip installs, which would reduce size further.After the changes: Using python-slim base image and chunking into layers for concurrent pulls:
vllm-slim-img:
Size:
10.0GBChanges made:
--no-cache-dirreduces size considerablydocker pullfaster (6 sequential extraction steps - extraction will still happen one by one, docker does not yet allow concurrent extraction)Docker Image Pull Times
Time taken for existing image before the changes: ~2 min
Time taken for slim image after the changes: ~1 min
Inference Performance
We also did not notice any significant difference in inference performance (latency, throughput) when using the Dockerfile before and after the changes.
Hardware:
--dtype=bfloat16 --gpu-memory-utilization=0.95Test tool:
Perf Test Results:
Before the changes (existing Dockerfile, vllm-otb-img):
After the changes (modified Dockerfile, vllm-slim-img):
Signed-off-by: The MathWorks, Inc.
Arunika Oyshi: aoyshi@mathworks.com