Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic vllm support #97

Merged
merged 1 commit into from
Sep 23, 2024
Merged

Add basic vllm support #97

merged 1 commit into from
Sep 23, 2024

Conversation

ericcurtin
Copy link
Collaborator

@ericcurtin ericcurtin commented Aug 29, 2024

This will only work with --nocontainer for now. Requires vllm > 6.0 .

@ericcurtin ericcurtin marked this pull request as draft August 29, 2024 10:30
@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Aug 29, 2024

I think we may need to make a breaking change here, "vllm serve" only serves .gguf file correctly with .gguf extension, so we might be better off storing files like:

~/.local/share/ramalama/models/ollama/granite-code:latest

like this instead:

~/.local/share/ramalama/models/ollama/granite-code:latest.gguf

@rhatdan
Copy link
Member

rhatdan commented Aug 29, 2024

Can we play games with symbolic links? I am not that concerned about breaking changes at this point, since we have not released anything yet.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Aug 29, 2024

Above are symlinks, the cleanest way is to make a breaking change and change this symlink:

~/.local/share/ramalama/models/ollama/granite-code:latest

to this one:

~/.local/share/ramalama/models/ollama/granite-code:latest.gguf

Also logged a bug with vllm:

vllm-project/vllm#7993

It's supposed to be possible to do this without the extension by using flags like "--quantization gguf", "--load-format gguf" but it doesn't work. GGUF inference is like 3 weeks old in vllm, they were using other formats before.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Aug 29, 2024

ramalama would solve this problem for many:

vllm-project/vllm#1441

This is one of the most commented issues in vllm... vllm focuses more on enterprise deployments, so it frustrates people they don't support Apple Devices...

Whereas with ramalama, one could test locally via:

ramalama --runtime llama.cpp serve, etc.

and if they want to use vllm on a deployment server, they could do:

ramalama --runtime vllm serve, etc.

potentially (or they could just use llama.cpp in both envs of course)...

@ericcurtin
Copy link
Collaborator Author

Some of these things are bug fixes I'm gonna split those out into another PR

@kannon92
Copy link

Excited for this! ollama/ollama#3953

There was some interest in that issue in the feature so we should make sure to highlight it as a feature.

@ericcurtin
Copy link
Collaborator Author

vllm support will probably be completed, once there is a new release of vllm, to get some PRs like:

vllm-project/vllm#8056

vllm release pretty frequently though, so it shouldn't be too far away...

@matbee-eth
Copy link

I think we may need to make a breaking change here, "vllm serve" only serves .gguf file correctly with .gguf extension, so we might be better off storing files like:

~/.local/share/ramalama/models/ollama/granite-code:latest

like this instead:

~/.local/share/ramalama/models/ollama/granite-code:latest.gguf

I'm not sure how feasible it is for this project, but a FuSE filesystem for dynamically managing models instead of symlinks could be an option

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Sep 2, 2024

I think we may need to make a breaking change here, "vllm serve" only serves .gguf file correctly with .gguf extension, so we might be better off storing files like:
~/.local/share/ramalama/models/ollama/granite-code:latest
like this instead:
~/.local/share/ramalama/models/ollama/granite-code:latest.gguf

I'm not sure how feasible it is for this project, but a FuSE filesystem for dynamically managing models instead of symlinks could be an option

Not sure we wanna go here right now. For one, we currently have compatibility for macOS native. This problem in particular will go away in the next version of vllm when the above PR gets merged.

But... I'd still be interested in hearing a more detailed example/explanation of why FuSE filesystem would help us here...

Although it would be preferable to have a fs-agnostic solution if at all possible...

Copy link

We were not able to find or create Copr project packit/containers-ramalama-97 specified in the config with the following error:

Packit received HTTP 500 Internal Server Error from Copr Service. Check the Copr status page: https://copr.fedorainfracloud.org/status/stats/, or ask for help in Fedora Build System matrix channel: https://matrix.to/#/#buildsys:fedoraproject.org.

Unless the HTTP status code above is >= 500, please check your configuration for:

  1. typos in owner and project name (groups need to be prefixed with @)
  2. whether the project name doesn't contain not allowed characters (only letters, digits, underscores, dashes and dots must be used)
  3. whether the project itself exists (Packit creates projects only in its own namespace)
  4. whether Packit is allowed to build in your Copr project
  5. whether your Copr project/group is not private

@ericcurtin ericcurtin changed the title Add runtime flag so we can alternatively serve via vllm Add basic vllm support Sep 21, 2024
@ericcurtin ericcurtin marked this pull request as ready for review September 21, 2024 14:15
@ericcurtin
Copy link
Collaborator Author

This is very basic support for the --nocontainer case, but may unblock other efforts like:

#150

@ericcurtin ericcurtin force-pushed the vllm branch 4 times, most recently from e7d33e1 to 39216e7 Compare September 23, 2024 15:39
ramalama/cli.py Outdated Show resolved Hide resolved
This will only work with --nocontainer for now. Requires vllm > 6.0 .

Signed-off-by: Eric Curtin <[email protected]>
@rhatdan
Copy link
Member

rhatdan commented Sep 23, 2024

LGTM

@rhatdan rhatdan merged commit 8c48bbd into main Sep 23, 2024
13 checks passed
@ericcurtin ericcurtin deleted the vllm branch September 23, 2024 20:19
@ericcurtin
Copy link
Collaborator Author

@kannon92 basic support for this was merged

@cdoern
Copy link

cdoern commented Sep 26, 2024

@ericcurtin from an ilab side of things I have been trying to see if we can get operations like generate on a mac using vllm. Could this help with that? it seems like the discourse here is saying that vllm works on the mac using ramalama?

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Sep 26, 2024

What this PR enables is the option to do:

$ ramalama --runtime vllm serve some-model

or

$ ramalama --runtime llama.cpp serve some-model

llama.cpp and vllm have different primary design goals. llama.cpp was written primarily with commodity grade systems in mind like a macbook, mac mini, etc. vllm was written with almost the opposite priority, to run on huge enterprise-grade systems with lots of vram and powerful gpus.

So with this change, one could test on macOS with llama.cpp as a runtime but deploy to some enterprise server with vllm as a runtime.

vllm does not support macOS today. llama.cpp is the best macOS inferencing runtime today, pretty much all projects running on macOS are using llama.cpp as a library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants