-
Notifications
You must be signed in to change notification settings - Fork 272
fix: vLLM serving and model mounting #1571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* vllm mount fixes for safetensor directories Signed-off-by: Kush Gupta <[email protected]> * Update ramalama/model.py for better file detection Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * make format Signed-off-by: Kush Gupta <[email protected]> * improve mount for files Signed-off-by: Kush Gupta <[email protected]> * fix docs for new vllm param Signed-off-by: Kush Gupta <[email protected]> * add error handling Signed-off-by: Kush Gupta <[email protected]> * fix cli param default implementation Signed-off-by: Kush Gupta <[email protected]> * adjust error message string Signed-off-by: Kush Gupta <[email protected]> * skip broken test Signed-off-by: Kush Gupta <[email protected]> --------- Signed-off-by: Kush Gupta <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Reviewer's GuideAdds comprehensive support for the vLLM runtime by implementing custom model mounting and runtime argument handling, introducing a configurable max-model-len flag, and updating documentation and tests to reflect vLLM’s capabilities and limitations. Sequence diagram for vLLM model serving with custom mounting and max model lengthsequenceDiagram
actor User
participant CLI as CLI
participant ModelHandler as ModelHandler
participant vLLM as vLLM Runtime
User->>CLI: run serve --runtime vllm --max-model-len 4096 --model /path/to/model
CLI->>ModelHandler: setup_mounts(model_path, args)
ModelHandler->>ModelHandler: Determine model_base (store-backed or direct path)
ModelHandler->>ModelHandler: Mount model directory
CLI->>ModelHandler: build_exec_args_serve(args, exec_model_path)
ModelHandler->>ModelHandler: handle_runtime(args, exec_args, exec_model_path)
ModelHandler->>vLLM: Start vLLM with --model and --max_model_len
vLLM-->>User: Model serving started
Entity relationship diagram for model store and vLLM mountingerDiagram
STORE ||--o| REFFILE : contains
REFFILE {
string hash
}
STORE {
string model_base_directory
}
MODELHANDLER {
string model_tag
string model_name
}
MODELHANDLER }o--|| STORE : uses
MODELHANDLER }o--o| REFFILE : references
Class diagram for updated model mounting and runtime handlingclassDiagram
class ModelHandler {
+setup_mounts(model_path, args)
+build_exec_args_serve(args, exec_model_path, chat_template_path, mm)
+handle_runtime(args, exec_args, exec_model_path)
+store
+model_tag
+engine
+model_name
+mnt_path
+get_model_path(args)
}
ModelHandler <|-- vLLMRuntimeHandler : uses
class vLLMRuntimeHandler {
+setup_mounts(model_path, args)
+handle_runtime(args, exec_args, exec_model_path)
+vllm_max_model_len
}
ModelHandler o-- Store : store
Store <|-- RefFile : get_ref_file()
RefFile : +hash
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kush-gupt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the integration of the vLLM runtime, enabling more flexible model serving and configuration. It introduces a new command-line option for vLLM's maximum model length, refines the logic for mounting various model types (including store-backed and direct file paths), and updates documentation and tests to reflect these changes and vLLM's specific model format requirements.
Highlights
- vLLM Runtime Support: Expanded support for the vLLM runtime, including robust handling of model paths from both internal stores and direct file system locations (individual files or directories).
- New CLI Flag for vLLM: Introduced a
--max-model-lenCLI flag to configure vLLM's maximum model length, providing more control to users over the context window. - Documentation and Test Updates: Updated documentation for the new CLI flag and adjusted system tests to account for vLLM's current inability to serve GGUF models, explicitly noting the requirement for safetensor models.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kush-gupt - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `ramalama/model.py:585` </location>
<code_context>
+ self.model_name,
+ ]
+
+ if hasattr(args, 'runtime_args') and args.runtime_args:
+ exec_args.extend(args.runtime_args)
else:
</code_context>
<issue_to_address>
Appending runtime_args directly may cause type issues if not a list.
If runtime_args is not a list, extend() will add each element individually (e.g., each character of a string). Ensure runtime_args is a list before extending exec_args.
</issue_to_address>
### Comment 2
<location> `ramalama/model.py:570` </location>
<code_context>
+ else:
+ container_model_path = os.path.join(MNT_DIR, os.path.basename(current_model_host_path))
+
+ vllm_max_model_len = 2048
+ if args.vllm_max_model_len:
+ vllm_max_model_len = args.vllm_max_model_len
+
+ exec_args = [
</code_context>
<issue_to_address>
Default value for vllm_max_model_len may be overridden by a falsy value.
Check specifically for None instead of relying on truthiness to prevent unintended overrides when the value is 0 or another falsy value.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
vllm_max_model_len = 2048
if args.vllm_max_model_len:
vllm_max_model_len = args.vllm_max_model_len
=======
vllm_max_model_len = 2048
if args.vllm_max_model_len is not None:
vllm_max_model_len = args.vllm_max_model_len
>>>>>>> REPLACE
</suggested_fix>
### Comment 3
<location> `ramalama/cli.py:857` </location>
<code_context>
parser.add_argument(
"--rag", help="RAG vector database or OCI Image to be served with the model", completer=local_models
)
+ parser.add_argument(
+ "--max-model-len",
+ dest="vllm_max_model_len",
+ type=int,
+ help="Maximum model length for vLLM",
+ completer=suppressCompleter,
+ )
if command in ["perplexity", "run", "serve"]:
</code_context>
<issue_to_address>
The help text for --max-model-len could be more descriptive.
Specify that this argument is only relevant for the vLLM runtime and explain the effects of changing its value to help prevent user misconfiguration.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
parser.add_argument(
"--max-model-len",
dest="vllm_max_model_len",
type=int,
help="Maximum model length for vLLM",
completer=suppressCompleter,
)
=======
parser.add_argument(
"--max-model-len",
dest="vllm_max_model_len",
type=int,
help=(
"Maximum model length (in tokens) for the vLLM runtime. "
"This argument is only relevant when using the vLLM runtime. "
"Increasing this value allows processing longer sequences, but may increase memory usage. "
"Setting this too high may cause out-of-memory errors, while setting it too low may truncate input or output."
),
completer=suppressCompleter,
)
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| self.engine.exec(stdout2null=args.noout) | ||
| return True | ||
|
|
||
| def setup_mounts(self, model_path, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Low code quality found in Model.setup_mounts - 16% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| exec_model_path = os.path.dirname(exec_model_path) | ||
| # Left out "vllm", "serve" the image entrypoint already starts it | ||
| exec_args = ["--port", args.port, "--model", MNT_FILE, "--max_model_len", "2048"] | ||
| container_model_path = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Extract code out into method (
extract-method) - Move setting of default value for variable into
elsebranch (introduce-default-else) - Replace if statement with if expression [×2] (
assign-if-exp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the vLLM runtime, including logic for mounting model directories, handling model paths, and a new CLI flag --max-model-len. The changes appear well-structured and cover store-backed models as well as direct file/directory paths. Documentation and tests have been updated accordingly.
|
Overall LGTM, Please respond to @afazekas and to Sourcery suggestions. |
ericcurtin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, lets alias --ctx-size and --max-model-len if they are the same
Signed-off-by: Kush Gupta <[email protected]>
Signed-off-by: Kush Gupta <[email protected]>
Signed-off-by: Kush Gupta <[email protected]>
|
LGTM |
This pull request effectively introduces support for the vLLM runtime. The changes include logic for mounting safetensor directories, handling model paths correctly whether they are store-backed or direct file paths (both individual files and directories), and exposing vLLM's maximum model length via a new
--max-model-lenCLI flag. The documentation has been updated accordingly, and test adjustments were made to reflect the inability for vLLM to serve GGUFs.Once the ramalama-vllm images are building, I can submit tests for vllm serving but I also have a series of bugs to submit on kube generation in general.
Example output with Mac ARM:
Summary by Sourcery
Enable full support for vLLM runtime by resolving mount points for store-backed and direct model paths, configuring serve arguments including a new max model length flag, and updating documentation and system tests to reflect vLLM limitations.
New Features:
Enhancements:
Documentation:
Tests: