Layered Dockerfile for smaller size and faster image pulling by aoyshi · Pull Request #22377 · vllm-project/vllm

aoyshi · 2025-08-06T17:49:42Z

UPDATE (12/10/2025)

Addressed PR feedback:

Installed FlashInfer similar to how it's done in https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile#L456
As suggested by vllm startup logs, installed torch-c-dlpack-ext for JIT-compiling torch-c-dlpack-ext to cache to enable EnvTensorAllocator

Size of official out-of-the-box vllm-openai:latest docker image: 28.6 GB

FROM vllm/vllm-openai:latest
ENV MODEL_PATH "/app/models/custom_model"
ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH $VLLM_ARGS

Size of our slim docker image, after adding FlashInfer and updating to latest vllm (0.12.0), the new image size: 22 GB

Overview

This PR introduces a smaller, layered alternative to the existing vLLM project's Dockerfile.

Reduce image size by about 47% by using a python-slim base image.
Reduce pull time by about 50% by creating smaller layers that can be pulled in parallel.

We propose that this optimized Dockerfile be made available as part of the vLLM project.

For production environments, a smaller Docker image that is pulled quickly can help with faster scale-up of new instances. We think that the proposed optimizations can also be helpful for use cases of vLLM Production Stack.

We chose to create a separate Dockerfile.slim instead of editing the main Dockerfile, since this method is for a more targetted use-case prioritizing a smaller size and faster image pull. We are open to discussing how to better merge these optimizations with the existing Dockerfile if possible, for a lighter production-ready version of the image that is more compatible with wider architectures.

Before the changes: Using existing Dockerfile in vLLM repo:

vllm-otb-img (where otb stands for 'Out-of-The-Box'):

FROM vllm/vllm-openai:latest

ENV MODEL_PATH "/app/models/custom_model"
ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH $VLLM_ARGS

Size: 21.2 GB

Most of the bloat is coming from the nvidia/cuda base image.
No --no-cache-dir for pip installs, which would reduce size further.

$ dive vllm-otb-img
Permission     UID:GID       Size  Filetree                                                        
drwxr-xr-x         0:0      20 GB  ├── usr                                                         
drwxr-xr-x         0:0      18 GB  │   ├── local                                                   
drwxr-xr-x         0:0      11 GB  │   │   ├── lib                                                 
drwxr-xr-x         0:0      11 GB  │   │   │   ├── python3.12                                      
drwxr-xr-x         0:0      11 GB  │   │   │   │   └── dist-packages                               
drwxr-xr-x         0:0     4.0 GB  │   │   │   │       ├─⊕ nvidia                                  
drwxr-xr-x         0:0     1.9 GB  │   │   │   │       ├─⊕ torch                                   
drwxr-xr-x         0:0     1.2 GB  │   │   │   │       ├─⊕ vllm                                    
drwxr-xr-x         0:0     916 MB  │   │   │   │       ├─⊕ flashinfer                              
drwxr-xr-x         0:0     563 MB  │   │   │   │       ├─⊕ triton                                  
drwxr-xr-x         0:0     243 MB  │   │   │   │       ├─⊕ bitsandbytes 
...

After the changes: Using python-slim base image and chunking into layers for concurrent pulls:

vllm-slim-img:

FROM python:3.12.9-slim AS builder
 
WORKDIR /tmp
 
# Install packages in a temporary directory
RUN pip install --no-cache-dir vllm==0.10.0 -t /tmp/python-packages
 
# Separate the nvidia packages (2.7 GB) into cudnn (1 GB), cublas (600 MB), and all else (1.2 GB)
# rm -rf needed at the end to remove the now-empty dirs after mv
RUN mkdir -p /chunk-nvidia/chunk-cudnn && \
  mkdir -p /chunk-nvidia/chunk-cublas && \
  mkdir -p /chunk-nvidia/other && \
  mv /tmp/python-packages/nvidia/cudnn /chunk-nvidia/chunk-cudnn && \
  mv /tmp/python-packages/nvidia/cublas /chunk-nvidia/chunk-cublas && \
  mv /tmp/python-packages/nvidia/* /chunk-nvidia/other && \
  rm -rf /chunk-nvidia/other/cudnn /chunk-nvidia/other/cublas
 
# Separate the torch packages (1.7 GB)
RUN mkdir -p /chunk-torch && \
  mv /tmp/python-packages/torch /chunk-torch/
 
# Separate the vllm packages (800 MB)
RUN mkdir -p /chunk-vllm && \
  mv /tmp/python-packages/vllm /chunk-vllm/
 
# Move the rest of the packages (1.8 GB)
# rm -rf needed at the end to remove the now-empty dirs after mv
RUN mkdir -p /chunk-other && \
  mv /tmp/python-packages/* /chunk-other/ && \
  rm -rf /chunk-other/nvidia /chunk-other/torch /chunk-other/vllm
 
# This is the final image
FROM python:3.12.9-slim
 
WORKDIR /app
 
# Copy each chunk into the final image into cohesive wholes
# each of these will be pulled concurrently during docker pull
COPY --from=builder /chunk-nvidia/chunk-cudnn/cudnn /usr/local/lib/python3.12/site-packages/nvidia/cudnn
COPY --from=builder /chunk-nvidia/chunk-cublas/cublas /usr/local/lib/python3.12/site-packages/nvidia/cublas
COPY --from=builder /chunk-nvidia/other /usr/local/lib/python3.12/site-packages/nvidia/
COPY --from=builder /chunk-torch /usr/local/lib/python3.12/site-packages/
COPY --from=builder /chunk-vllm /usr/local/lib/python3.12/site-packages/
COPY --from=builder /chunk-other /usr/local/lib/python3.12/site-packages/
 
# Install FlashInfer
RUN pip install "https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl"
 
ENV MODEL_PATH "/app/models/custom_model"
 
# Install GCC
RUN apt-get update && apt-get install -y build-essential
 
ENTRYPOINT ["sh", "-c", "python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH $VLLM_ARGS"]

Size: 10.0 GB

Changes made:

Clearing Python build cache with --no-cache-dir reduces size considerably
Using a slimmer base image significantly reduces size of final image
Chunking and layering creates 6 concurrent download steps which makes docker pull faster (6 sequential extraction steps - extraction will still happen one by one, docker does not yet allow concurrent extraction)

$ dive vllm-slim-img

Permission     UID:GID       Size  Filetree                                                             
drwxr-xr-x         0:0     9.7 GB  ├── usr                                                         
drwxr-xr-x         0:0     9.3 GB  │   ├── local
drwxr-xr-x         0:0     9.3 GB  │   │   ├── lib                                                 
drwxr-xr-x         0:0     9.3 GB  │   │   │   ├── python3.12                                      
drwxr-xr-x         0:0     9.3 GB  │   │   │   │   ├── site-packages                               
drwxr-xr-x         0:0     2.8 GB  │   │   │   │   │   ├── nvidia                                  
drwxr-xr-x         0:0     875 MB  │   │   │   │   │   │   ├─⊕ cudnn                               
drwxr-xr-x         0:0     601 MB  │   │   │   │   │   │   ├─⊕ cublas                              
                                                           ...                    
drwxr-xr-x         0:0     1.6 GB  │   │   │   │   │   ├─⊕ torch
drwxr-xr-x         0:0     1.2 GB  │   │   │   │   │   ├─⊕ vllm                                    
drwxr-xr-x         0:0     911 MB  │   │   │   │   │   ├─⊕ flashinfer                              
drwxr-xr-x         0:0     564 MB  │   │   │   │   │   ├─⊕ triton
...

Docker Image Pull Times

Time taken for existing image before the changes: ~2 min

$ time docker pull private-docker.repositories.private.com/com.private.llmserving.vllm-otb-image:latest
latest: Pulling from com.private.llmserving.vllm-otb-image
23828d760c7b: Pull complete 
edd1dba56169: Pull complete 
e06eb1b5c4cc: Pull complete 
7f308a765276: Pull complete 
3af11d09e9cd: Pull complete 
42896cdfd7b6: Pull complete 
600519079558: Pull complete 
0ae42424cadf: Pull complete 
73b7968785dc: Pull complete 
80150f70fb1e: Pull complete 
3bd5db8307cf: Pull complete 
62e3cec31574: Pull complete 
be24ca11895c: Pull complete 
9770c15f94eb: Pull complete 
02c834bfce5a: Pull complete 
7ba71cdfa783: Pull complete 
766fd898109b: Pull complete 
ac693ee3141c: Pull complete 
1eca19995d3c: Pull complete 
2dd2a38f2767: Pull complete 
51a57ac495a5: Pull complete 
dbf1ab618a8a: Pull complete 
1e3d86e47f15: Pull complete 
Digest: sha256:70be1ae3e73586ee15d9cc26d322335d6cedf29208f716dccf88d3c2a0280d17

real    2m3.209s
user    0m0.192s
sys     0m0.143s

Time taken for slim image after the changes: ~1 min

$ time docker pull private-docker.repositories.private.com/com.private.llmserving.vllm-slim-img:latest
latest: Pulling from com.private.llmserving.vllm-slim-img
8a628cdd7ccc: Pull complete 
d9612276b664: Pull complete 
b365a43716b1: Pull complete 
e639439a2713: Pull complete 
f70b3e250ce6: Pull complete 
34f7cc35bfd3: Pull complete 
a7668597d77d: Pull complete 
63f4d6948f5a: Pull complete 
e0520688ea2d: Pull complete 
c45cfaa0539c: Pull complete 
41a658338ac3: Pull complete 
f6657868f9cb: Pull complete 
2cf38c5130df: Pull complete 
Digest: sha256:7462cf24d6964b5d49c71e67683ee926949bb09cdf5e72ce863811327bb8e93e

real    1m9.668s
user    0m0.074s
sys     0m0.100s

Inference Performance

We also did not notice any significant difference in inference performance (latency, throughput) when using the Dockerfile before and after the changes.

Hardware:

GPU: 1 A100/80GB
vCPUs: 24
vLLM version: 0.10.0
vLLM engine arguments: --dtype=bfloat16 --gpu-memory-utilization=0.95

Test tool:

genai-perf

genai-perf profile
--model /app/models/custom_model
--tokenizer Qwen/Qwen-tokenizer
--verbose
--service-kind openai
--endpoint-type completions        
--url localhost:8000  
--num-dataset-entries 5000               
--synthetic-input-tokens-mean 250 
--output-tokens-mean 25   
--request-rate 25
--request-count 1000

Perf Test Results:

Before the changes (existing Dockerfile, vllm-otb-img):

NVIDIA GenAI-Perf | LLM Metrics                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃      avg ┃    min ┃      max ┃      p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│              Request Latency (ms) │   546.60 │  80.86 │ 1,079.46 │ 1,019.52 │ 559.23 │ 555.55 │
│   Output Sequence Length (tokens) │    24.88 │   2.00 │    26.00 │    25.00 │  25.00 │  25.00 │
│    Input Sequence Length (tokens) │   250.00 │ 250.00 │   250.00 │   250.00 │ 250.00 │ 250.00 │
│ Output Token Throughput (per sec) │   617.35 │    N/A │      N/A │      N/A │    N/A │    N/A │
│      Request Throughput (per sec) │    24.82 │    N/A │      N/A │      N/A │    N/A │    N/A │
│             Request Count (count) │ 1,000.00 │    N/A │      N/A │      N/A │    N/A │    N/A │
└───────────────────────────────────┴──────────┴────────┴──────────┴──────────┴────────┴────────┘
Prefix Cache Hit Rate: 0.1%

After the changes (modified Dockerfile, vllm-slim-img):

NVIDIA GenAI-Perf | LLM Metrics                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃      avg ┃    min ┃      max ┃      p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│              Request Latency (ms) │   558.81 │ 338.19 │ 1,083.17 │ 1,047.95 │ 564.04 │ 561.15 │
│   Output Sequence Length (tokens) │    24.96 │  16.00 │    31.00 │    25.00 │  25.00 │  25.00 │
│    Input Sequence Length (tokens) │   250.00 │ 250.00 │   250.00 │   250.00 │ 250.00 │ 250.00 │
│ Output Token Throughput (per sec) │   619.48 │    N/A │      N/A │      N/A │    N/A │    N/A │
│      Request Throughput (per sec) │    24.82 │    N/A │      N/A │      N/A │    N/A │    N/A │
│             Request Count (count) │ 1,000.00 │    N/A │      N/A │      N/A │    N/A │    N/A │
└───────────────────────────────────┴──────────┴────────┴──────────┴──────────┴────────┴────────┘
Prefix Cache Hit Rate: 0.1%

Signed-off-by: The MathWorks, Inc.
Arunika Oyshi: aoyshi@mathworks.com

Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>

github-actions · 2025-08-06T17:49:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This PR introduces a layered Dockerfile to create a smaller and faster-pulling image, which is a great initiative. The approach of using a multi-stage build with a slim base image and layering dependencies is effective. My review focuses on some critical correctness and maintainability issues that should be addressed to make this Dockerfile robust and production-ready. The main concerns are an incompatible Python wheel, improper signal handling in the entrypoint, and opportunities to further reduce image size and improve maintainability.

Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>

simon-mo · 2025-08-08T16:19:12Z

Hi @aoyshi, thank you for this PR. A better approach will be directly make the image slim, can you clear these caches in Dockerfile? That will really help!

geraldstanje · 2025-10-18T17:56:56Z

hi @aoyshi
im running into some issue when using vllm 0.11 and running the gemma3 4b model. any idea? i also see error with vllm 0.10

./run.sh 
gemma3_vllm_pytorch_2.9_server
INFO 10-18 17:54:13 [__init__.py:216] Automatically detected platform cuda.
Traceback (most recent call last):
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "<frozen runpy>", line 198, in _run_module_as_main
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "<frozen runpy>", line 88, in _run_code
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 42, in <module>
    from vllm.config import VllmConfig
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/config/__init__.py", line 34, in <module>
    from vllm.config.lora import LoRAConfig
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/config/lora.py", line 14, in <module>
    from vllm.platforms import current_platform
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/platforms/__init__.py", line 248, in __getattr__
    _current_platform = resolve_obj_by_qualname(
                        ^^^^^^^^^^^^^^^^^^^^^^^^
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2680, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [File](http://www.google.com/search?hl=en&q=allinurl%3Adocs.oracle.com+javase+docs+api+file) "/usr/local/lib/python3.12/site-packages/vllm/platforms/cuda.py", line 18, in <module>
    import vllm._C  # noqa
    ^^^^^^^^^^^^^^
ImportError: /usr/local/lib/python3.12/site-packages/vllm/_C.abi3.so: undefined symbol: _ZN3c104cuda9SetDeviceEab
Server running on http://localhost:8000

Dockerfile

FROM python:3.12.9-slim AS builder
 
WORKDIR /tmp
 
# Install packages in a temporary directory
RUN pip install --no-cache-dir vllm==0.11.0 -t /tmp/python-packages
 
# Separate the nvidia packages (2.7 GB) into cudnn (1 GB), cublas (600 MB), and all else (1.2 GB)
# rm -rf needed at the end to remove the now-empty dirs after mv
RUN mkdir -p /chunk-nvidia/chunk-cudnn && \
  mkdir -p /chunk-nvidia/chunk-cublas && \
  mkdir -p /chunk-nvidia/other && \
  mv /tmp/python-packages/nvidia/cudnn /chunk-nvidia/chunk-cudnn && \
  mv /tmp/python-packages/nvidia/cublas /chunk-nvidia/chunk-cublas && \
  mv /tmp/python-packages/nvidia/* /chunk-nvidia/other && \
  rm -rf /chunk-nvidia/other/cudnn /chunk-nvidia/other/cublas
 
# Separate the torch packages (1.7 GB)
RUN mkdir -p /chunk-torch && \
  mv /tmp/python-packages/torch /chunk-torch/
 
# Separate the vllm packages (800 MB)
RUN mkdir -p /chunk-vllm && \
  mv /tmp/python-packages/vllm /chunk-vllm/
 
# Move the rest of the packages (1.8 GB)
# rm -rf needed at the end to remove the now-empty dirs after mv
RUN mkdir -p /chunk-other && \
  mv /tmp/python-packages/* /chunk-other/ && \
  rm -rf /chunk-other/nvidia /chunk-other/torch /chunk-other/vllm
 
# This is the final image
FROM python:3.12.9-slim
 
WORKDIR /app
 
# Copy each chunk into the final image into cohesive wholes
# each of these will be pulled concurrently during docker pull
COPY --from=builder /chunk-nvidia/chunk-cudnn/cudnn /usr/local/lib/python3.12/site-packages/nvidia/cudnn
COPY --from=builder /chunk-nvidia/chunk-cublas/cublas /usr/local/lib/python3.12/site-packages/nvidia/cublas
COPY --from=builder /chunk-nvidia/other /usr/local/lib/python3.12/site-packages/nvidia/
COPY --from=builder /chunk-torch /usr/local/lib/python3.12/site-packages/
COPY --from=builder /chunk-vllm /usr/local/lib/python3.12/site-packages/
COPY --from=builder /chunk-other /usr/local/lib/python3.12/site-packages/
 
# Install FlashInfer
RUN pip install "https://d...content-available-to-author-only...h.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl"
 
COPY gemma-3-4b-it /app/gemma-3-4b-it
ENV MODEL_PATH "/app/gemma-3-4b-it"
 
# Install GCC
RUN apt-get update && apt-get install -y build-essential
 
ENTRYPOINT ["sh", "-c", "python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH $VLLM_ARGS"]

geraldstanje · 2025-10-22T19:31:33Z

some info about installing flash infer: #1454 https://github.com/substratusai/vllm-docker/blob/main/Dockerfile.cuda-arm

address PR feedback (install flashinfer and torch-c-dlpack-ext, update vllm to latest) Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>

github-actions · 2026-03-23T02:17:48Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2026-04-23T02:18:14Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

benoittgt · 2026-04-28T13:11:43Z

Hi @aoyshi, thank you for this PR. A better approach will be directly make the image slim, can you clear these caches in Dockerfile? That will really help!

I had a look at this. I made a PR in #41134

Create Dockerfile.slim

4621831

Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>

mergify Bot added the ci/build label Aug 6, 2025

gemini-code-assist Bot reviewed Aug 6, 2025

View reviewed changes

Comment thread docker/Dockerfile.slim Outdated

Comment thread docker/Dockerfile.slim

Comment thread docker/Dockerfile.slim Outdated

Comment thread docker/Dockerfile.slim Outdated

aoyshi mentioned this pull request Aug 6, 2025

[Misc]: How to reduce docker image size for vllm with python-slim? #13112

Closed

1 task

aoyshi added 2 commits August 7, 2025 15:19

address code review feedback

0b30f3d

Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>

flashinfer as todo item

6a81c09

Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>

geraldstanje mentioned this pull request Oct 18, 2025

[Installation]: How to reduce the vllm image #27154

Closed

1 task

Update Dockerfile.slim

16e7576

address PR feedback (install flashinfer and torch-c-dlpack-ext, update vllm to latest) Signed-off-by: aoyshi <37639117+aoyshi@users.noreply.github.com>

github-actions Bot added the stale Over 90 days of inactivity label Mar 23, 2026

github-actions Bot closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Layered Dockerfile for smaller size and faster image pulling#22377

Layered Dockerfile for smaller size and faster image pulling#22377
aoyshi wants to merge 4 commits into
vllm-project:mainfrom
aoyshi:slim-docker-image

aoyshi commented Aug 6, 2025 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Aug 6, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simon-mo commented Aug 8, 2025

Uh oh!

geraldstanje commented Oct 18, 2025

Uh oh!

geraldstanje commented Oct 22, 2025

Uh oh!

github-actions Bot commented Mar 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

benoittgt commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

aoyshi commented Aug 6, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Before the changes: Using existing Dockerfile in vLLM repo:

After the changes: Using python-slim base image and chunking into layers for concurrent pulls:

Docker Image Pull Times

Time taken for existing image before the changes: ~2 min

Time taken for slim image after the changes: ~1 min

Inference Performance

Hardware:

Test tool:

Perf Test Results:

Uh oh!

github-actions Bot commented Aug 6, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simon-mo commented Aug 8, 2025

Uh oh!

geraldstanje commented Oct 18, 2025

Uh oh!

geraldstanje commented Oct 22, 2025

Uh oh!

github-actions Bot commented Mar 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

benoittgt commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aoyshi commented Aug 6, 2025 •

edited by github-actions Bot

Loading