-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic vllm support #97
Conversation
I think we may need to make a breaking change here, "vllm serve" only serves .gguf file correctly with .gguf extension, so we might be better off storing files like: ~/.local/share/ramalama/models/ollama/granite-code:latest like this instead: ~/.local/share/ramalama/models/ollama/granite-code:latest.gguf |
Can we play games with symbolic links? I am not that concerned about breaking changes at this point, since we have not released anything yet. |
Above are symlinks, the cleanest way is to make a breaking change and change this symlink: ~/.local/share/ramalama/models/ollama/granite-code:latest to this one: ~/.local/share/ramalama/models/ollama/granite-code:latest.gguf Also logged a bug with vllm: It's supposed to be possible to do this without the extension by using flags like "--quantization gguf", "--load-format gguf" but it doesn't work. GGUF inference is like 3 weeks old in vllm, they were using other formats before. |
ramalama would solve this problem for many: This is one of the most commented issues in vllm... vllm focuses more on enterprise deployments, so it frustrates people they don't support Apple Devices... Whereas with ramalama, one could test locally via: ramalama --runtime llama.cpp serve, etc. and if they want to use vllm on a deployment server, they could do: ramalama --runtime vllm serve, etc. potentially (or they could just use llama.cpp in both envs of course)... |
Some of these things are bug fixes I'm gonna split those out into another PR |
Excited for this! ollama/ollama#3953 There was some interest in that issue in the feature so we should make sure to highlight it as a feature. |
vllm support will probably be completed, once there is a new release of vllm, to get some PRs like: vllm release pretty frequently though, so it shouldn't be too far away... |
I'm not sure how feasible it is for this project, but a FuSE filesystem for dynamically managing models instead of symlinks could be an option |
Not sure we wanna go here right now. For one, we currently have compatibility for macOS native. This problem in particular will go away in the next version of vllm when the above PR gets merged. But... I'd still be interested in hearing a more detailed example/explanation of why FuSE filesystem would help us here... Although it would be preferable to have a fs-agnostic solution if at all possible... |
We were not able to find or create Copr project
Unless the HTTP status code above is >= 500, please check your configuration for:
|
This is very basic support for the --nocontainer case, but may unblock other efforts like: |
e7d33e1
to
39216e7
Compare
This will only work with --nocontainer for now. Requires vllm > 6.0 . Signed-off-by: Eric Curtin <[email protected]>
LGTM |
@kannon92 basic support for this was merged |
@ericcurtin from an ilab side of things I have been trying to see if we can get operations like generate on a mac using vllm. Could this help with that? it seems like the discourse here is saying that vllm works on the mac using ramalama? |
What this PR enables is the option to do:
or
llama.cpp and vllm have different primary design goals. llama.cpp was written primarily with commodity grade systems in mind like a macbook, mac mini, etc. vllm was written with almost the opposite priority, to run on huge enterprise-grade systems with lots of vram and powerful gpus. So with this change, one could test on macOS with llama.cpp as a runtime but deploy to some enterprise server with vllm as a runtime. vllm does not support macOS today. llama.cpp is the best macOS inferencing runtime today, pretty much all projects running on macOS are using llama.cpp as a library. |
This will only work with --nocontainer for now. Requires vllm > 6.0 .