Skip to content

Layered Dockerfile for smaller size and faster image pulling#22377

Closed
aoyshi wants to merge 4 commits into
vllm-project:mainfrom
aoyshi:slim-docker-image
Closed

Layered Dockerfile for smaller size and faster image pulling#22377
aoyshi wants to merge 4 commits into
vllm-project:mainfrom
aoyshi:slim-docker-image

Conversation

@aoyshi

@aoyshi aoyshi commented Aug 6, 2025

Copy link
Copy Markdown

UPDATE (12/10/2025)

Addressed PR feedback:

  1. Installed FlashInfer similar to how it's done in https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile#L456
  2. As suggested by vllm startup logs, installed torch-c-dlpack-ext for JIT-compiling torch-c-dlpack-ext to cache to enable EnvTensorAllocator

Size of official out-of-the-box vllm-openai:latest docker image: 28.6 GB

FROM vllm/vllm-openai:latest
ENV MODEL_PATH "/app/models/custom_model"
ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH $VLLM_ARGS

Size of our slim docker image, after adding FlashInfer and updating to latest vllm (0.12.0), the new image size: 22 GB


Overview

This PR introduces a smaller, layered alternative to the existing vLLM project's Dockerfile.

  1. Reduce image size by about 47% by using a python-slim base image.
  2. Reduce pull time by about 50% by creating smaller layers that can be pulled in parallel.

We propose that this optimized Dockerfile be made available as part of the vLLM project.

For production environments, a smaller Docker image that is pulled quickly can help with faster scale-up of new instances. We think that the proposed optimizations can also be helpful for use cases of vLLM Production Stack.

We chose to create a separate Dockerfile.slim instead of editing the main Dockerfile, since this method is for a more targetted use-case prioritizing a smaller size and faster image pull. We are open to discussing how to better merge these optimizations with the existing Dockerfile if possible, for a lighter production-ready version of the image that is more compatible with wider architectures.

Before the changes: Using existing Dockerfile in vLLM repo:

vllm-otb-img (where otb stands for 'Out-of-The-Box'):

FROM vllm/vllm-openai:latest

ENV MODEL_PATH "/app/models/custom_model"
ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH $VLLM_ARGS

Size: 21.2 GB

  • Most of the bloat is coming from the nvidia/cuda base image.
  • No --no-cache-dir for pip installs, which would reduce size further.
$ dive vllm-otb-img
Permission     UID:GID       Size  Filetree                                                        
drwxr-xr-x         0:0      20 GB  ├── usr                                                         
drwxr-xr-x         0:0      18 GB  │   ├── local                                                   
drwxr-xr-x         0:0      11 GB  │   │   ├── lib                                                 
drwxr-xr-x         0:0      11 GB  │   │   │   ├── python3.12                                      
drwxr-xr-x         0:0      11 GB  │   │   │   │   └── dist-packages                               
drwxr-xr-x         0:0     4.0 GB  │   │   │   │       ├─⊕ nvidia                                  
drwxr-xr-x         0:0     1.9 GB  │   │   │   │       ├─⊕ torch                                   
drwxr-xr-x         0:0     1.2 GB  │   │   │   │       ├─⊕ vllm                                    
drwxr-xr-x         0:0     916 MB  │   │   │   │       ├─⊕ flashinfer                              
drwxr-xr-x         0:0     563 MB  │   │   │   │       ├─⊕ triton                                  
drwxr-xr-x         0:0     243 MB  │   │   │   │       ├─⊕ bitsandbytes 
...

After the changes: Using python-slim base image and chunking into layers for concurrent pulls:

vllm-slim-img:

FROM python:3.12.9-slim AS builder
 
WORKDIR /tmp
 
# Install packages in a temporary directory
RUN pip install --no-cache-dir vllm==0.10.0 -t /tmp/python-packages
 
# Separate the nvidia packages (2.7 GB) into cudnn (1 GB), cublas (600 MB), and all else (1.2 GB)
# rm -rf needed at the end to remove the now-empty dirs after mv
RUN mkdir -p /chunk-nvidia/chunk-cudnn && \
  mkdir -p /chunk-nvidia/chunk-cublas && \
  mkdir -p /chunk-nvidia/other && \
  mv /tmp/python-packages/nvidia/cudnn /chunk-nvidia/chunk-cudnn && \
  mv /tmp/python-packages/nvidia/cublas /chunk-nvidia/chunk-cublas && \
  mv /tmp/python-packages/nvidia/* /chunk-nvidia/other && \
  rm -rf /chunk-nvidia/other/cudnn /chunk-nvidia/other/cublas
 
# Separate the torch packages (1.7 GB)
RUN mkdir -p /chunk-torch && \
  mv /tmp/python-packages/torch /chunk-torch/
 
# Separate the vllm packages (800 MB)
RUN mkdir -p /chunk-vllm && \
  mv /tmp/python-packages/vllm /chunk-vllm/
 
# Move the rest of the packages (1.8 GB)
# rm -rf needed at the end to remove the now-empty dirs after mv
RUN mkdir -p /chunk-other && \
  mv /tmp/python-packages/* /chunk-other/ && \
  rm -rf /chunk-other/nvidia /chunk-other/torch /chunk-other/vllm
 
# This is the final image
FROM python:3.12.9-slim
 
WORKDIR /app
 
# Copy each chunk into the final image into cohesive wholes
# each of these will be pulled concurrently during docker pull
COPY --from=builder /chunk-nvidia/chunk-cudnn/cudnn /usr/local/lib/python3.12/site-packages/nvidia/cudnn
COPY --from=builder /chunk-nvidia/chunk-cublas/cublas /usr/local/lib/python3.12/site-packages/nvidia/cublas
COPY --from=builder /chunk-nvidia/other /usr/local/lib/python3.12/site-packages/nvidia/
COPY --from=builder /chunk-torch /usr/local/lib/python3.12/site-packages/
COPY --from=builder /chunk-vllm /usr/local/lib/python3.12/site-packages/
COPY --from=builder /chunk-other /usr/local/lib/python3.12/site-packages/
 
# Install FlashInfer
RUN pip install "https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl"
 
ENV MODEL_PATH "/app/models/custom_model"
 
# Install GCC
RUN apt-get update && apt-get install -y build-essential
 
ENTRYPOINT ["sh", "-c", "python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH $VLLM_ARGS"]

Size: 10.0 GB

Changes made:

  • Clearing Python build cache with --no-cache-dir reduces size considerably
  • Using a slimmer base image significantly reduces size of final image
  • Chunking and layering creates 6 concurrent download steps which makes docker pull faster (6 sequential extraction steps - extraction will still happen one by one, docker does not yet allow concurrent extraction)
$ dive vllm-slim-img

Permission     UID:GID       Size  Filetree                                                             
drwxr-xr-x         0:0     9.7 GB  ├── usr                                                         
drwxr-xr-x         0:0     9.3 GB  │   ├── local
drwxr-xr-x         0:0     9.3 GB  │   │   ├── lib                                                 
drwxr-xr-x         0:0     9.3 GB  │   │   │   ├── python3.12                                      
drwxr-xr-x         0:0     9.3 GB  │   │   │   │   ├── site-packages                               
drwxr-xr-x         0:0     2.8 GB  │   │   │   │   │   ├── nvidia                                  
drwxr-xr-x         0:0     875 MB  │   │   │   │   │   │   ├─⊕ cudnn                               
drwxr-xr-x         0:0     601 MB  │   │   │   │   │   │   ├─⊕ cublas                              
                                                           ...                    
drwxr-xr-x         0:0     1.6 GB  │   │   │   │   │   ├─⊕ torch
drwxr-xr-x         0:0     1.2 GB  │   │   │   │   │   ├─⊕ vllm                                    
drwxr-xr-x         0:0     911 MB  │   │   │   │   │   ├─⊕ flashinfer                              
drwxr-xr-x         0:0     564 MB  │   │   │   │   │   ├─⊕ triton
...                                                                                            

Docker Image Pull Times

Time taken for existing image before the changes: ~2 min

$ time docker pull private-docker.repositories.private.com/com.private.llmserving.vllm-otb-image:latest
latest: Pulling from com.private.llmserving.vllm-otb-image
23828d760c7b: Pull complete 
edd1dba56169: Pull complete 
e06eb1b5c4cc: Pull complete 
7f308a765276: Pull complete 
3af11d09e9cd: Pull complete 
42896cdfd7b6: Pull complete 
600519079558: Pull complete 
0ae42424cadf: Pull complete 
73b7968785dc: Pull complete 
80150f70fb1e: Pull complete 
3bd5db8307cf: Pull complete 
62e3cec31574: Pull complete 
be24ca11895c: Pull complete 
9770c15f94eb: Pull complete 
02c834bfce5a: Pull complete 
7ba71cdfa783: Pull complete 
766fd898109b: Pull complete 
ac693ee3141c: Pull complete 
1eca19995d3c: Pull complete 
2dd2a38f2767: Pull complete 
51a57ac495a5: Pull complete 
dbf1ab618a8a: Pull complete 
1e3d86e47f15: Pull complete 
Digest: sha256:70be1ae3e73586ee15d9cc26d322335d6cedf29208f716dccf88d3c2a0280d17

real    2m3.209s
user    0m0.192s
sys     0m0.143s

Time taken for slim image after the changes: ~1 min

$ time docker pull private-docker.repositories.private.com/com.private.llmserving.vllm-slim-img:latest
latest: Pulling from com.private.llmserving.vllm-slim-img
8a628cdd7ccc: Pull complete 
d9612276b664: Pull complete 
b365a43716b1: Pull complete 
e639439a2713: Pull complete 
f70b3e250ce6: Pull complete 
34f7cc35bfd3: Pull complete 
a7668597d77d: Pull complete 
63f4d6948f5a: Pull complete 
e0520688ea2d: Pull complete 
c45cfaa0539c: Pull complete 
41a658338ac3: Pull complete 
f6657868f9cb: Pull complete 
2cf38c5130df: Pull complete 
Digest: sha256:7462cf24d6964b5d49c71e67683ee926949bb09cdf5e72ce863811327bb8e93e

real    1m9.668s
user    0m0.074s
sys     0m0.100s

Inference Performance

We also did not notice any significant difference in inference performance (latency, throughput) when using the Dockerfile before and after the changes.

Hardware:

  • GPU: 1 A100/80GB
  • vCPUs: 24
  • vLLM version: 0.10.0
  • vLLM engine arguments: --dtype=bfloat16 --gpu-memory-utilization=0.95

Test tool:

genai-perf profile
--model /app/models/custom_model
--tokenizer Qwen/Qwen-tokenizer
--verbose
--service-kind openai
--endpoint-type completions        
--url localhost:8000  
--num-dataset-entries 5000               
--synthetic-input-tokens-mean 250 
--output-tokens-mean 25   
--request-rate 25
--request-count 1000

Perf Test Results:

Before the changes (existing Dockerfile, vllm-otb-img):

NVIDIA GenAI-Perf | LLM Metrics                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃      avg ┃    min ┃      max ┃      p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│              Request Latency (ms) │   546.60 │  80.86 │ 1,079.46 │ 1,019.52 │ 559.23 │ 555.55 │
│   Output Sequence Length (tokens) │    24.88 │   2.00 │    26.00 │    25.00 │  25.00 │  25.00 │
│    Input Sequence Length (tokens) │   250.00 │ 250.00 │   250.00 │   250.00 │ 250.00 │ 250.00 │
│ Output Token Throughput (per sec) │   617.35 │    N/A │      N/A │      N/A │    N/A │    N/A │
│      Request Throughput (per sec) │    24.82 │    N/A │      N/A │      N/A │    N/A │    N/A │
│             Request Count (count) │ 1,000.00 │    N/A │      N/A │      N/A │    N/A │    N/A │
└───────────────────────────────────┴──────────┴────────┴──────────┴──────────┴────────┴────────┘
Prefix Cache Hit Rate: 0.1%

After the changes (modified Dockerfile, vllm-slim-img):

NVIDIA GenAI-Perf | LLM Metrics                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃      avg ┃    min ┃      max ┃      p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│              Request Latency (ms) │   558.81 │ 338.19 │ 1,083.17 │ 1,047.95 │ 564.04 │ 561.15 │
│   Output Sequence Length (tokens) │    24.96 │  16.00 │    31.00 │    25.00 │  25.00 │  25.00 │
│    Input Sequence Length (tokens) │   250.00 │ 250.00 │   250.00 │   250.00 │ 250.00 │ 250.00 │
│ Output Token Throughput (per sec) │   619.48 │    N/A │      N/A │      N/A │    N/A │    N/A │
│      Request Throughput (per sec) │    24.82 │    N/A │      N/A │      N/A │    N/A │    N/A │
│             Request Count (count) │ 1,000.00 │    N/A │      N/A │      N/A │    N/A │    N/A │
└───────────────────────────────────┴──────────┴────────┴──────────┴──────────┴────────┴────────┘
Prefix Cache Hit Rate: 0.1%

Signed-off-by: The MathWorks, Inc.
Arunika Oyshi: aoyshi@mathworks.com

Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>
@github-actions

github-actions Bot commented Aug 6, 2025

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify Bot added the ci/build label Aug 6, 2025

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a layered Dockerfile to create a smaller and faster-pulling image, which is a great initiative. The approach of using a multi-stage build with a slim base image and layering dependencies is effective. My review focuses on some critical correctness and maintainability issues that should be addressed to make this Dockerfile robust and production-ready. The main concerns are an incompatible Python wheel, improper signal handling in the entrypoint, and opportunities to further reduce image size and improve maintainability.

Comment thread docker/Dockerfile.slim Outdated
Comment thread docker/Dockerfile.slim
Comment thread docker/Dockerfile.slim Outdated
Comment thread docker/Dockerfile.slim Outdated
aoyshi added 2 commits August 7, 2025 15:19
Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>
Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>
@simon-mo

simon-mo commented Aug 8, 2025

Copy link
Copy Markdown
Collaborator

Hi @aoyshi, thank you for this PR. A better approach will be directly make the image slim, can you clear these caches in Dockerfile? That will really help!

@geraldstanje

Copy link
Copy Markdown

hi @aoyshi
im running into some issue when using vllm 0.11 and running the gemma3 4b model. any idea? i also see error with vllm 0.10

./run.sh 
gemma3_vllm_pytorch_2.9_server
INFO 10-18 17:54:13 [__init__.py:216] Automatically detected platform cuda.
Traceback (most recent call last):
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "<frozen runpy>", line 198, in _run_module_as_main
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "<frozen runpy>", line 88, in _run_code
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 42, in <module>
    from vllm.config import VllmConfig
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/config/__init__.py", line 34, in <module>
    from vllm.config.lora import LoRAConfig
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/config/lora.py", line 14, in <module>
    from vllm.platforms import current_platform
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/platforms/__init__.py", line 248, in __getattr__
    _current_platform = resolve_obj_by_qualname(
                        ^^^^^^^^^^^^^^^^^^^^^^^^
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2680, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/platforms/cuda.py", line 18, in <module>
    import vllm._C  # noqa
    ^^^^^^^^^^^^^^
ImportError: /usr/local/lib/python3.12/site-packages/vllm/_C.abi3.so: undefined symbol: _ZN3c104cuda9SetDeviceEab
Server running on http://localhost:8000

Dockerfile

FROM python:3.12.9-slim AS builder
 
WORKDIR /tmp
 
# Install packages in a temporary directory
RUN pip install --no-cache-dir vllm==0.11.0 -t /tmp/python-packages
 
# Separate the nvidia packages (2.7 GB) into cudnn (1 GB), cublas (600 MB), and all else (1.2 GB)
# rm -rf needed at the end to remove the now-empty dirs after mv
RUN mkdir -p /chunk-nvidia/chunk-cudnn && \
  mkdir -p /chunk-nvidia/chunk-cublas && \
  mkdir -p /chunk-nvidia/other && \
  mv /tmp/python-packages/nvidia/cudnn /chunk-nvidia/chunk-cudnn && \
  mv /tmp/python-packages/nvidia/cublas /chunk-nvidia/chunk-cublas && \
  mv /tmp/python-packages/nvidia/* /chunk-nvidia/other && \
  rm -rf /chunk-nvidia/other/cudnn /chunk-nvidia/other/cublas
 
# Separate the torch packages (1.7 GB)
RUN mkdir -p /chunk-torch && \
  mv /tmp/python-packages/torch /chunk-torch/
 
# Separate the vllm packages (800 MB)
RUN mkdir -p /chunk-vllm && \
  mv /tmp/python-packages/vllm /chunk-vllm/
 
# Move the rest of the packages (1.8 GB)
# rm -rf needed at the end to remove the now-empty dirs after mv
RUN mkdir -p /chunk-other && \
  mv /tmp/python-packages/* /chunk-other/ && \
  rm -rf /chunk-other/nvidia /chunk-other/torch /chunk-other/vllm
 
# This is the final image
FROM python:3.12.9-slim
 
WORKDIR /app
 
# Copy each chunk into the final image into cohesive wholes
# each of these will be pulled concurrently during docker pull
COPY --from=builder /chunk-nvidia/chunk-cudnn/cudnn /usr/local/lib/python3.12/site-packages/nvidia/cudnn
COPY --from=builder /chunk-nvidia/chunk-cublas/cublas /usr/local/lib/python3.12/site-packages/nvidia/cublas
COPY --from=builder /chunk-nvidia/other /usr/local/lib/python3.12/site-packages/nvidia/
COPY --from=builder /chunk-torch /usr/local/lib/python3.12/site-packages/
COPY --from=builder /chunk-vllm /usr/local/lib/python3.12/site-packages/
COPY --from=builder /chunk-other /usr/local/lib/python3.12/site-packages/
 
# Install FlashInfer
RUN pip install "https://d...content-available-to-author-only...h.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl"
 
COPY gemma-3-4b-it /app/gemma-3-4b-it
ENV MODEL_PATH "/app/gemma-3-4b-it"
 
# Install GCC
RUN apt-get update && apt-get install -y build-essential
 
ENTRYPOINT ["sh", "-c", "python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH $VLLM_ARGS"]

@geraldstanje

Copy link
Copy Markdown

some info about installing flash infer: #1454 https://github.com/substratusai/vllm-docker/blob/main/Dockerfile.cuda-arm

address PR feedback (install flashinfer and torch-c-dlpack-ext, update vllm to latest)

Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions Bot added the stale Over 90 days of inactivity label Mar 23, 2026
@github-actions

Copy link
Copy Markdown

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

@github-actions github-actions Bot closed this Apr 23, 2026
@benoittgt

Copy link
Copy Markdown
Contributor

Hi @aoyshi, thank you for this PR. A better approach will be directly make the image slim, can you clear these caches in Dockerfile? That will really help!

I had a look at this. I made a PR in #41134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build stale Over 90 days of inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants