fix: vLLM serving and model mounting #1571

kush-gupt · 2025-06-20T02:09:29Z

This pull request effectively introduces support for the vLLM runtime. The changes include logic for mounting safetensor directories, handling model paths correctly whether they are store-backed or direct file paths (both individual files and directories), and exposing vLLM's maximum model length via a new --max-model-len CLI flag. The documentation has been updated accordingly, and test adjustments were made to reflect the inability for vLLM to serve GGUFs.

Once the ramalama-vllm images are building, I can submit tests for vllm serving but I also have a series of bugs to submit on kube generation in general.

Example output with Mac ARM:

❯ ramalama --runtime vllm --image quay.io/kugupta/vllm-cpu-arm --container serve hf://TinyLlama/TinyLlama-1.1B-Chat-v1.0
INFO 06-20 02:16:56 [__init__.py:239] Automatically detected platform cpu.
INFO 06-20 02:16:57 [api_server.py:1043] vLLM API server version 0.1.dev5905+g686623c
INFO 06-20 02:16:57 [api_server.py:1044] args: Namespace(host=None, port=8080, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/models/snapshots/sha256-9737c9ce5074907aa44d575918c5367cd877471f', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=2048, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['TinyLlama-1.1B-Chat-v1.0'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, enable_chunked_prefill=None, multi_step_stream_outputs=True, scheduling_policy='fcfs', disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 06-20 02:17:00 [config.py:713] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
WARNING 06-20 02:17:00 [arg_utils.py:1736] device type=cpu is not supported by the V1 Engine. Falling back to V0. 
INFO 06-20 02:17:00 [config.py:1776] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 06-20 02:17:00 [cpu.py:106] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 06-20 02:17:00 [cpu.py:119] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 06-20 02:17:00 [api_server.py:246] Started engine process with PID 15
INFO 06-20 02:17:01 [__init__.py:239] Automatically detected platform cpu.
INFO 06-20 02:17:01 [llm_engine.py:243] Initializing a V0 LLM engine (v0.1.dev5905+g686623c) with config: model='/mnt/models/snapshots/sha256-9737c9ce5074907aa44d575918c5367cd877471f', speculative_config=None, tokenizer='/mnt/models/snapshots/sha256-9737c9ce5074907aa44d575918c5367cd877471f', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=TinyLlama-1.1B-Chat-v1.0, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 06-20 02:17:02 [cpu.py:45] Using Torch SDPA backend.
INFO 06-20 02:17:02 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 06-20 02:17:02 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.91s/it]

INFO 06-20 02:17:04 [loader.py:458] Loading weights took 1.94 seconds
INFO 06-20 02:17:04 [executor_base.py:112] # cpu blocks: 11915, # CPU blocks: 0
INFO 06-20 02:17:04 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 93.09x
INFO 06-20 02:17:05 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 1.32 seconds
INFO 06-20 02:17:05 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8080
INFO 06-20 02:17:05 [launcher.py:28] Available routes are:
INFO 06-20 02:17:05 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /health, Methods: GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /load, Methods: GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /ping, Methods: POST, GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /version, Methods: GET
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /pooling, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /score, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /rerank, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /invocations, Methods: POST
INFO 06-20 02:17:05 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [2]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Summary by Sourcery

Enable full support for vLLM runtime by resolving mount points for store-backed and direct model paths, configuring serve arguments including a new max model length flag, and updating documentation and system tests to reflect vLLM limitations.

New Features:

Add vLLM runtime mode with dedicated mount logic for both store-backed and local model paths
Introduce --max-model-len CLI option to expose vLLM's maximum model length

Enhancements:

Resolve container model path from snapshots or host paths and merge custom runtime args for vLLM serving
Add error handling when a valid host directory cannot be determined for vLLM mounts

Documentation:

Document the --max-model-len flag in run and serve CLI man pages

Tests:

Skip GGUF serving in vLLM system tests and adjust cleanup steps in 040-serve.bats

* vllm mount fixes for safetensor directories Signed-off-by: Kush Gupta <[email protected]> * Update ramalama/model.py for better file detection Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * make format Signed-off-by: Kush Gupta <[email protected]> * improve mount for files Signed-off-by: Kush Gupta <[email protected]> * fix docs for new vllm param Signed-off-by: Kush Gupta <[email protected]> * add error handling Signed-off-by: Kush Gupta <[email protected]> * fix cli param default implementation Signed-off-by: Kush Gupta <[email protected]> * adjust error message string Signed-off-by: Kush Gupta <[email protected]> * skip broken test Signed-off-by: Kush Gupta <[email protected]> --------- Signed-off-by: Kush Gupta <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

sourcery-ai · 2025-06-20T02:09:33Z

Reviewer's Guide

Adds comprehensive support for the vLLM runtime by implementing custom model mounting and runtime argument handling, introducing a configurable max-model-len flag, and updating documentation and tests to reflect vLLM’s capabilities and limitations.

Sequence diagram for vLLM model serving with custom mounting and max model length

sequenceDiagram
    actor User
    participant CLI as CLI
    participant ModelHandler as ModelHandler
    participant vLLM as vLLM Runtime
    User->>CLI: run serve --runtime vllm --max-model-len 4096 --model /path/to/model
    CLI->>ModelHandler: setup_mounts(model_path, args)
    ModelHandler->>ModelHandler: Determine model_base (store-backed or direct path)
    ModelHandler->>ModelHandler: Mount model directory
    CLI->>ModelHandler: build_exec_args_serve(args, exec_model_path)
    ModelHandler->>ModelHandler: handle_runtime(args, exec_args, exec_model_path)
    ModelHandler->>vLLM: Start vLLM with --model and --max_model_len
    vLLM-->>User: Model serving started

Entity relationship diagram for model store and vLLM mounting

erDiagram
    STORE ||--o| REFFILE : contains
    REFFILE {
        string hash
    }
    STORE {
        string model_base_directory
    }
    MODELHANDLER {
        string model_tag
        string model_name
    }
    MODELHANDLER }o--|| STORE : uses
    MODELHANDLER }o--o| REFFILE : references

Class diagram for updated model mounting and runtime handling

classDiagram
    class ModelHandler {
        +setup_mounts(model_path, args)
        +build_exec_args_serve(args, exec_model_path, chat_template_path, mm)
        +handle_runtime(args, exec_args, exec_model_path)
        +store
        +model_tag
        +engine
        +model_name
        +mnt_path
        +get_model_path(args)
    }
    ModelHandler <|-- vLLMRuntimeHandler : uses
    class vLLMRuntimeHandler {
        +setup_mounts(model_path, args)
        +handle_runtime(args, exec_args, exec_model_path)
        +vllm_max_model_len
    }
    ModelHandler o-- Store : store
    Store <|-- RefFile : get_ref_file()
    RefFile : +hash

File-Level Changes

Change	Details	Files
Introduce vLLM-specific mounting logic in setup_mounts	Branch on vLLM runtime in setup_mounts Derive host model directory from store or direct path Bind-mount determined directory or error if unresolved	`ramalama/model.py`
Implement vLLM runtime argument handling in handle_runtime	Resolve container model path from store snapshot or host path Configure default and CLI-driven max_model_len Assemble exec_args with port, model, length and model name Append any additional runtime_args	`ramalama/model.py`
Add --max-model-len CLI flag for vLLM	Define vllm_max_model_len argument in runtime options parser	`ramalama/cli.py`
Update documentation for the new max-model-len option	Document flag in ramalama-run.1.md Document flag in ramalama-serve.1.md	`docs/ramalama-run.1.md` `docs/ramalama-serve.1.md`
Adjust system tests to skip unsupported vLLM GGUF scenarios	Skip GGUF serving tests for vLLM Clean up image and YAML removal logic	`test/system/040-serve.bats`

Possibly linked issues

Resolve undefined errors in huggingface.py and model.py #123: The PR fixes the invalid option error encountered when serving with the vLLM runtime by correctly handling model paths and exec arguments.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

gemini-code-assist

Summary of Changes

Hello @kush-gupt, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the integration of the vLLM runtime, enabling more flexible model serving and configuration. It introduces a new command-line option for vLLM's maximum model length, refines the logic for mounting various model types (including store-backed and direct file paths), and updates documentation and tests to reflect these changes and vLLM's specific model format requirements.

Highlights

vLLM Runtime Support: Expanded support for the vLLM runtime, including robust handling of model paths from both internal stores and direct file system locations (individual files or directories).
New CLI Flag for vLLM: Introduced a --max-model-len CLI flag to configure vLLM's maximum model length, providing more control to users over the context window.
Documentation and Test Updates: Updated documentation for the new CLI flag and adjusted system tests to account for vLLM's current inability to serve GGUF models, explicitly noting the requirement for safetensor models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sourcery-ai

Hey @kush-gupt - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `ramalama/model.py:585` </location>
<code_context>
+                self.model_name,
+            ]
+
+            if hasattr(args, 'runtime_args') and args.runtime_args:
+                exec_args.extend(args.runtime_args)
         else:
</code_context>

<issue_to_address>
Appending runtime_args directly may cause type issues if not a list.

If runtime_args is not a list, extend() will add each element individually (e.g., each character of a string). Ensure runtime_args is a list before extending exec_args.
</issue_to_address>

### Comment 2
<location> `ramalama/model.py:570` </location>
<code_context>
+                else:
+                    container_model_path = os.path.join(MNT_DIR, os.path.basename(current_model_host_path))
+
+            vllm_max_model_len = 2048
+            if args.vllm_max_model_len:
+                vllm_max_model_len = args.vllm_max_model_len
+
+            exec_args = [
</code_context>

<issue_to_address>
Default value for vllm_max_model_len may be overridden by a falsy value.

Check specifically for None instead of relying on truthiness to prevent unintended overrides when the value is 0 or another falsy value.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
            vllm_max_model_len = 2048
            if args.vllm_max_model_len:
                vllm_max_model_len = args.vllm_max_model_len
=======
            vllm_max_model_len = 2048
            if args.vllm_max_model_len is not None:
                vllm_max_model_len = args.vllm_max_model_len
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `ramalama/cli.py:857` </location>
<code_context>
         parser.add_argument(
             "--rag", help="RAG vector database or OCI Image to be served with the model", completer=local_models
         )
+        parser.add_argument(
+            "--max-model-len",
+            dest="vllm_max_model_len",
+            type=int,
+            help="Maximum model length for vLLM",
+            completer=suppressCompleter,
+        )
     if command in ["perplexity", "run", "serve"]:
</code_context>

<issue_to_address>
The help text for --max-model-len could be more descriptive.

Specify that this argument is only relevant for the vLLM runtime and explain the effects of changing its value to help prevent user misconfiguration.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        parser.add_argument(
            "--max-model-len",
            dest="vllm_max_model_len",
            type=int,
            help="Maximum model length for vLLM",
            completer=suppressCompleter,
        )
=======
        parser.add_argument(
            "--max-model-len",
            dest="vllm_max_model_len",
            type=int,
            help=(
                "Maximum model length (in tokens) for the vLLM runtime. "
                "This argument is only relevant when using the vLLM runtime. "
                "Increasing this value allows processing longer sequences, but may increase memory usage. "
                "Setting this too high may cause out-of-memory errors, while setting it too low may truncate input or output."
            ),
            completer=suppressCompleter,
        )
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

ramalama/model.py

ramalama/cli.py

ramalama/model.py

sourcery-ai · 2025-06-20T02:10:23Z

ramalama/model.py

        self.engine.exec(stdout2null=args.noout)
        return True

    def setup_mounts(self, model_path, args):


issue (code-quality): Low code quality found in Model.setup_mounts - 16% (low-code-quality)

Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

sourcery-ai · 2025-06-20T02:10:23Z

ramalama/model.py

-            exec_model_path = os.path.dirname(exec_model_path)
-            # Left out "vllm", "serve" the image entrypoint already starts it
-            exec_args = ["--port", args.port, "--model", MNT_FILE, "--max_model_len", "2048"]
+            container_model_path = ""


issue (code-quality): We've found these issues:

Extract code out into method (extract-method)

Move setting of default value for variable into else branch (introduce-default-else)

Replace if statement with if expression [×2] (assign-if-exp)

gemini-code-assist

Code Review

This pull request introduces support for the vLLM runtime, including logic for mounting model directories, handling model paths, and a new CLI flag --max-model-len. The changes appear well-structured and cover store-backed models as well as direct file/directory paths. Documentation and tests have been updated accordingly.

ramalama/model.py

ramalama/cli.py

rhatdan · 2025-06-20T10:24:18Z

Overall LGTM, Please respond to @afazekas and to Sourcery suggestions.

ericcurtin

LGTM, lets alias --ctx-size and --max-model-len if they are the same

Signed-off-by: Kush Gupta <[email protected]>

rhatdan · 2025-06-20T14:16:38Z

LGTM

kush-gupt and others added 2 commits June 19, 2025 22:04

Merge branch 'containers:main' into main

55b7d56

kush-gupt requested review from bmahabirbu, cgruver, engelmi, ericcurtin, jhjaggars, maxamillion, rhatdan, slp and swarajpande5 as code owners June 20, 2025 02:09

gemini-code-assist bot reviewed Jun 20, 2025

View reviewed changes

sourcery-ai bot approved these changes Jun 20, 2025

View reviewed changes

gemini-code-assist bot reviewed Jun 20, 2025

View reviewed changes

ramalama/model.py Show resolved Hide resolved

afazekas reviewed Jun 20, 2025

View reviewed changes

ramalama/cli.py Outdated Show resolved Hide resolved

ericcurtin approved these changes Jun 20, 2025

View reviewed changes

kush-gupt added 3 commits June 20, 2025 08:26

alias max model len, improve file mounting logic

d0ecd5b

Signed-off-by: Kush Gupta <[email protected]>

fix doc typo and codespell test

e698424

Signed-off-by: Kush Gupta <[email protected]>

fix doc validation

c4ec0a5

Signed-off-by: Kush Gupta <[email protected]>

ericcurtin approved these changes Jun 20, 2025

View reviewed changes

ericcurtin merged commit aa29aa6 into containers:main Jun 20, 2025
12 of 17 checks passed

fix: vLLM serving and model mounting #1571

fix: vLLM serving and model mounting #1571

Uh oh!

Conversation

kush-gupt commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for vLLM model serving with custom mounting and max model length

Entity relationship diagram for model store and vLLM mounting

Class diagram for updated model mounting and runtime handling

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

rhatdan commented Jun 20, 2025

Uh oh!

ericcurtin left a comment

Choose a reason for hiding this comment

Uh oh!

rhatdan commented Jun 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kush-gupt commented Jun 20, 2025 •

edited

Loading

sourcery-ai bot commented Jun 20, 2025 •

edited

Loading