Skip to content

Conversation

@kush-gupt
Copy link
Contributor

@kush-gupt kush-gupt commented Jun 20, 2025

This pull request effectively introduces support for the vLLM runtime. The changes include logic for mounting safetensor directories, handling model paths correctly whether they are store-backed or direct file paths (both individual files and directories), and exposing vLLM's maximum model length via a new --max-model-len CLI flag. The documentation has been updated accordingly, and test adjustments were made to reflect the inability for vLLM to serve GGUFs.

Once the ramalama-vllm images are building, I can submit tests for vllm serving but I also have a series of bugs to submit on kube generation in general.

Example output with Mac ARM:

❯ ramalama --runtime vllm --image quay.io/kugupta/vllm-cpu-arm --container serve hf://TinyLlama/TinyLlama-1.1B-Chat-v1.0
INFO 06-20 02:16:56 [__init__.py:239] Automatically detected platform cpu.
INFO 06-20 02:16:57 [api_server.py:1043] vLLM API server version 0.1.dev5905+g686623c
INFO 06-20 02:16:57 [api_server.py:1044] args: Namespace(host=None, port=8080, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/models/snapshots/sha256-9737c9ce5074907aa44d575918c5367cd877471f', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=2048, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['TinyLlama-1.1B-Chat-v1.0'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, enable_chunked_prefill=None, multi_step_stream_outputs=True, scheduling_policy='fcfs', disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 06-20 02:17:00 [config.py:713] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
WARNING 06-20 02:17:00 [arg_utils.py:1736] device type=cpu is not supported by the V1 Engine. Falling back to V0. 
INFO 06-20 02:17:00 [config.py:1776] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 06-20 02:17:00 [cpu.py:106] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 06-20 02:17:00 [cpu.py:119] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 06-20 02:17:00 [api_server.py:246] Started engine process with PID 15
INFO 06-20 02:17:01 [__init__.py:239] Automatically detected platform cpu.
INFO 06-20 02:17:01 [llm_engine.py:243] Initializing a V0 LLM engine (v0.1.dev5905+g686623c) with config: model='/mnt/models/snapshots/sha256-9737c9ce5074907aa44d575918c5367cd877471f', speculative_config=None, tokenizer='/mnt/models/snapshots/sha256-9737c9ce5074907aa44d575918c5367cd877471f', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=TinyLlama-1.1B-Chat-v1.0, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 06-20 02:17:02 [cpu.py:45] Using Torch SDPA backend.
INFO 06-20 02:17:02 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 06-20 02:17:02 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.91s/it]

INFO 06-20 02:17:04 [loader.py:458] Loading weights took 1.94 seconds
INFO 06-20 02:17:04 [executor_base.py:112] # cpu blocks: 11915, # CPU blocks: 0
INFO 06-20 02:17:04 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 93.09x
INFO 06-20 02:17:05 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 1.32 seconds
INFO 06-20 02:17:05 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8080
INFO 06-20 02:17:05 [launcher.py:28] Available routes are:
INFO 06-20 02:17:05 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /health, Methods: GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /load, Methods: GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /ping, Methods: POST, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /version, Methods: GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /pooling, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /score, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /rerank, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /invocations, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [2]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Summary by Sourcery

Enable full support for vLLM runtime by resolving mount points for store-backed and direct model paths, configuring serve arguments including a new max model length flag, and updating documentation and system tests to reflect vLLM limitations.

New Features:

  • Add vLLM runtime mode with dedicated mount logic for both store-backed and local model paths
  • Introduce --max-model-len CLI option to expose vLLM's maximum model length

Enhancements:

  • Resolve container model path from snapshots or host paths and merge custom runtime args for vLLM serving
  • Add error handling when a valid host directory cannot be determined for vLLM mounts

Documentation:

  • Document the --max-model-len flag in run and serve CLI man pages

Tests:

  • Skip GGUF serving in vLLM system tests and adjust cleanup steps in 040-serve.bats

kush-gupt and others added 2 commits June 19, 2025 22:04
* vllm mount fixes for safetensor directories

Signed-off-by: Kush Gupta <[email protected]>

* Update ramalama/model.py for better file detection

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* make format

Signed-off-by: Kush Gupta <[email protected]>

* improve mount for files

Signed-off-by: Kush Gupta <[email protected]>

* fix docs for new vllm param

Signed-off-by: Kush Gupta <[email protected]>

* add error handling

Signed-off-by: Kush Gupta <[email protected]>

* fix cli param default implementation

Signed-off-by: Kush Gupta <[email protected]>

* adjust error message string

Signed-off-by: Kush Gupta <[email protected]>

* skip broken test

Signed-off-by: Kush Gupta <[email protected]>

---------

Signed-off-by: Kush Gupta <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jun 20, 2025

Reviewer's Guide

Adds comprehensive support for the vLLM runtime by implementing custom model mounting and runtime argument handling, introducing a configurable max-model-len flag, and updating documentation and tests to reflect vLLM’s capabilities and limitations.

Sequence diagram for vLLM model serving with custom mounting and max model length

sequenceDiagram
    actor User
    participant CLI as CLI
    participant ModelHandler as ModelHandler
    participant vLLM as vLLM Runtime
    User->>CLI: run serve --runtime vllm --max-model-len 4096 --model /path/to/model
    CLI->>ModelHandler: setup_mounts(model_path, args)
    ModelHandler->>ModelHandler: Determine model_base (store-backed or direct path)
    ModelHandler->>ModelHandler: Mount model directory
    CLI->>ModelHandler: build_exec_args_serve(args, exec_model_path)
    ModelHandler->>ModelHandler: handle_runtime(args, exec_args, exec_model_path)
    ModelHandler->>vLLM: Start vLLM with --model and --max_model_len
    vLLM-->>User: Model serving started
Loading

Entity relationship diagram for model store and vLLM mounting

erDiagram
    STORE ||--o| REFFILE : contains
    REFFILE {
        string hash
    }
    STORE {
        string model_base_directory
    }
    MODELHANDLER {
        string model_tag
        string model_name
    }
    MODELHANDLER }o--|| STORE : uses
    MODELHANDLER }o--o| REFFILE : references
Loading

Class diagram for updated model mounting and runtime handling

classDiagram
    class ModelHandler {
        +setup_mounts(model_path, args)
        +build_exec_args_serve(args, exec_model_path, chat_template_path, mm)
        +handle_runtime(args, exec_args, exec_model_path)
        +store
        +model_tag
        +engine
        +model_name
        +mnt_path
        +get_model_path(args)
    }
    ModelHandler <|-- vLLMRuntimeHandler : uses
    class vLLMRuntimeHandler {
        +setup_mounts(model_path, args)
        +handle_runtime(args, exec_args, exec_model_path)
        +vllm_max_model_len
    }
    ModelHandler o-- Store : store
    Store <|-- RefFile : get_ref_file()
    RefFile : +hash
Loading

File-Level Changes

Change Details Files
Introduce vLLM-specific mounting logic in setup_mounts
  • Branch on vLLM runtime in setup_mounts
  • Derive host model directory from store or direct path
  • Bind-mount determined directory or error if unresolved
ramalama/model.py
Implement vLLM runtime argument handling in handle_runtime
  • Resolve container model path from store snapshot or host path
  • Configure default and CLI-driven max_model_len
  • Assemble exec_args with port, model, length and model name
  • Append any additional runtime_args
ramalama/model.py
Add --max-model-len CLI flag for vLLM
  • Define vllm_max_model_len argument in runtime options parser
ramalama/cli.py
Update documentation for the new max-model-len option
  • Document flag in ramalama-run.1.md
  • Document flag in ramalama-serve.1.md
docs/ramalama-run.1.md
docs/ramalama-serve.1.md
Adjust system tests to skip unsupported vLLM GGUF scenarios
  • Skip GGUF serving tests for vLLM
  • Clean up image and YAML removal logic
test/system/040-serve.bats

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kush-gupt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the integration of the vLLM runtime, enabling more flexible model serving and configuration. It introduces a new command-line option for vLLM's maximum model length, refines the logic for mounting various model types (including store-backed and direct file paths), and updates documentation and tests to reflect these changes and vLLM's specific model format requirements.

Highlights

  • vLLM Runtime Support: Expanded support for the vLLM runtime, including robust handling of model paths from both internal stores and direct file system locations (individual files or directories).
  • New CLI Flag for vLLM: Introduced a --max-model-len CLI flag to configure vLLM's maximum model length, providing more control to users over the context window.
  • Documentation and Test Updates: Updated documentation for the new CLI flag and adjusted system tests to account for vLLM's current inability to serve GGUF models, explicitly noting the requirement for safetensor models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kush-gupt - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `ramalama/model.py:585` </location>
<code_context>
+                self.model_name,
+            ]
+
+            if hasattr(args, 'runtime_args') and args.runtime_args:
+                exec_args.extend(args.runtime_args)
         else:
</code_context>

<issue_to_address>
Appending runtime_args directly may cause type issues if not a list.

If runtime_args is not a list, extend() will add each element individually (e.g., each character of a string). Ensure runtime_args is a list before extending exec_args.
</issue_to_address>

### Comment 2
<location> `ramalama/model.py:570` </location>
<code_context>
+                else:
+                    container_model_path = os.path.join(MNT_DIR, os.path.basename(current_model_host_path))
+
+            vllm_max_model_len = 2048
+            if args.vllm_max_model_len:
+                vllm_max_model_len = args.vllm_max_model_len
+
+            exec_args = [
</code_context>

<issue_to_address>
Default value for vllm_max_model_len may be overridden by a falsy value.

Check specifically for None instead of relying on truthiness to prevent unintended overrides when the value is 0 or another falsy value.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
            vllm_max_model_len = 2048
            if args.vllm_max_model_len:
                vllm_max_model_len = args.vllm_max_model_len
=======
            vllm_max_model_len = 2048
            if args.vllm_max_model_len is not None:
                vllm_max_model_len = args.vllm_max_model_len
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `ramalama/cli.py:857` </location>
<code_context>
         parser.add_argument(
             "--rag", help="RAG vector database or OCI Image to be served with the model", completer=local_models
         )
+        parser.add_argument(
+            "--max-model-len",
+            dest="vllm_max_model_len",
+            type=int,
+            help="Maximum model length for vLLM",
+            completer=suppressCompleter,
+        )
     if command in ["perplexity", "run", "serve"]:
</code_context>

<issue_to_address>
The help text for --max-model-len could be more descriptive.

Specify that this argument is only relevant for the vLLM runtime and explain the effects of changing its value to help prevent user misconfiguration.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        parser.add_argument(
            "--max-model-len",
            dest="vllm_max_model_len",
            type=int,
            help="Maximum model length for vLLM",
            completer=suppressCompleter,
        )
=======
        parser.add_argument(
            "--max-model-len",
            dest="vllm_max_model_len",
            type=int,
            help=(
                "Maximum model length (in tokens) for the vLLM runtime. "
                "This argument is only relevant when using the vLLM runtime. "
                "Increasing this value allows processing longer sequences, but may increase memory usage. "
                "Setting this too high may cause out-of-memory errors, while setting it too low may truncate input or output."
            ),
            completer=suppressCompleter,
        )
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

self.engine.exec(stdout2null=args.noout)
return True

def setup_mounts(self, model_path, args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in Model.setup_mounts - 16% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

exec_model_path = os.path.dirname(exec_model_path)
# Left out "vllm", "serve" the image entrypoint already starts it
exec_args = ["--port", args.port, "--model", MNT_FILE, "--max_model_len", "2048"]
container_model_path = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the vLLM runtime, including logic for mounting model directories, handling model paths, and a new CLI flag --max-model-len. The changes appear well-structured and cover store-backed models as well as direct file/directory paths. Documentation and tests have been updated accordingly.

@rhatdan
Copy link
Member

rhatdan commented Jun 20, 2025

Overall LGTM, Please respond to @afazekas and to Sourcery suggestions.

Copy link
Member

@ericcurtin ericcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, lets alias --ctx-size and --max-model-len if they are the same

@rhatdan
Copy link
Member

rhatdan commented Jun 20, 2025

LGTM

@ericcurtin ericcurtin merged commit aa29aa6 into containers:main Jun 20, 2025
12 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants