Add basic vllm support #97

ericcurtin · 2024-08-29T10:29:34Z

This will only work with --nocontainer for now. Requires vllm > 6.0 .

ericcurtin · 2024-08-29T10:34:30Z

I think we may need to make a breaking change here, "vllm serve" only serves .gguf file correctly with .gguf extension, so we might be better off storing files like:

~/.local/share/ramalama/models/ollama/granite-code:latest

like this instead:

~/.local/share/ramalama/models/ollama/granite-code:latest.gguf

rhatdan · 2024-08-29T10:51:44Z

Can we play games with symbolic links? I am not that concerned about breaking changes at this point, since we have not released anything yet.

ericcurtin · 2024-08-29T11:01:05Z

Above are symlinks, the cleanest way is to make a breaking change and change this symlink:

~/.local/share/ramalama/models/ollama/granite-code:latest

to this one:

~/.local/share/ramalama/models/ollama/granite-code:latest.gguf

Also logged a bug with vllm:

vllm-project/vllm#7993

It's supposed to be possible to do this without the extension by using flags like "--quantization gguf", "--load-format gguf" but it doesn't work. GGUF inference is like 3 weeks old in vllm, they were using other formats before.

ericcurtin · 2024-08-29T13:39:08Z

ramalama would solve this problem for many:

vllm-project/vllm#1441

This is one of the most commented issues in vllm... vllm focuses more on enterprise deployments, so it frustrates people they don't support Apple Devices...

Whereas with ramalama, one could test locally via:

ramalama --runtime llama.cpp serve, etc.

and if they want to use vllm on a deployment server, they could do:

ramalama --runtime vllm serve, etc.

potentially (or they could just use llama.cpp in both envs of course)...

ericcurtin · 2024-08-29T13:50:17Z

Some of these things are bug fixes I'm gonna split those out into another PR

kannon92 · 2024-08-31T14:04:50Z

Excited for this! ollama/ollama#3953

There was some interest in that issue in the feature so we should make sure to highlight it as a feature.

ericcurtin · 2024-08-31T14:16:51Z

vllm support will probably be completed, once there is a new release of vllm, to get some PRs like:

vllm-project/vllm#8056

vllm release pretty frequently though, so it shouldn't be too far away...

matbee-eth · 2024-09-01T14:41:14Z

I think we may need to make a breaking change here, "vllm serve" only serves .gguf file correctly with .gguf extension, so we might be better off storing files like:

~/.local/share/ramalama/models/ollama/granite-code:latest

like this instead:

~/.local/share/ramalama/models/ollama/granite-code:latest.gguf

I'm not sure how feasible it is for this project, but a FuSE filesystem for dynamically managing models instead of symlinks could be an option

ericcurtin · 2024-09-02T03:46:51Z

I think we may need to make a breaking change here, "vllm serve" only serves .gguf file correctly with .gguf extension, so we might be better off storing files like:
~/.local/share/ramalama/models/ollama/granite-code:latest
like this instead:
~/.local/share/ramalama/models/ollama/granite-code:latest.gguf

I'm not sure how feasible it is for this project, but a FuSE filesystem for dynamically managing models instead of symlinks could be an option

Not sure we wanna go here right now. For one, we currently have compatibility for macOS native. This problem in particular will go away in the next version of vllm when the above PR gets merged.

But... I'd still be interested in hearing a more detailed example/explanation of why FuSE filesystem would help us here...

Although it would be preferable to have a fs-agnostic solution if at all possible...

packit-as-a-service · 2024-09-21T14:12:38Z

We were not able to find or create Copr project packit/containers-ramalama-97 specified in the config with the following error:

Packit received HTTP 500 Internal Server Error from Copr Service. Check the Copr status page: https://copr.fedorainfracloud.org/status/stats/, or ask for help in Fedora Build System matrix channel: https://matrix.to/#/#buildsys:fedoraproject.org.

Unless the HTTP status code above is >= 500, please check your configuration for:

typos in owner and project name (groups need to be prefixed with @)
whether the project name doesn't contain not allowed characters (only letters, digits, underscores, dashes and dots must be used)
whether the project itself exists (Packit creates projects only in its own namespace)
whether Packit is allowed to build in your Copr project
whether your Copr project/group is not private

ericcurtin · 2024-09-21T14:16:21Z

This is very basic support for the --nocontainer case, but may unblock other efforts like:

#150

ramalama/cli.py

This will only work with --nocontainer for now. Requires vllm > 6.0 . Signed-off-by: Eric Curtin <[email protected]>

rhatdan · 2024-09-23T19:41:13Z

LGTM

ericcurtin · 2024-09-24T00:56:26Z

@kannon92 basic support for this was merged

cdoern · 2024-09-26T00:40:34Z

@ericcurtin from an ilab side of things I have been trying to see if we can get operations like generate on a mac using vllm. Could this help with that? it seems like the discourse here is saying that vllm works on the mac using ramalama?

ericcurtin · 2024-09-26T10:10:47Z

What this PR enables is the option to do:

$ ramalama --runtime vllm serve some-model

or

$ ramalama --runtime llama.cpp serve some-model

llama.cpp and vllm have different primary design goals. llama.cpp was written primarily with commodity grade systems in mind like a macbook, mac mini, etc. vllm was written with almost the opposite priority, to run on huge enterprise-grade systems with lots of vram and powerful gpus.

So with this change, one could test on macOS with llama.cpp as a runtime but deploy to some enterprise server with vllm as a runtime.

vllm does not support macOS today. llama.cpp is the best macOS inferencing runtime today, pretty much all projects running on macOS are using llama.cpp as a library.

ericcurtin marked this pull request as draft August 29, 2024 10:30

ericcurtin force-pushed the vllm branch from 608bb9f to e70d918 Compare August 29, 2024 10:35

ericcurtin mentioned this pull request Sep 17, 2024

Vision models #150

Open

ericcurtin force-pushed the vllm branch from e70d918 to 62090b6 Compare September 21, 2024 14:12

ericcurtin changed the title ~~Add runtime flag so we can alternatively serve via vllm~~ Add basic vllm support Sep 21, 2024

ericcurtin force-pushed the vllm branch from 62090b6 to 1bec6b4 Compare September 21, 2024 14:15

ericcurtin marked this pull request as ready for review September 21, 2024 14:15

ericcurtin force-pushed the vllm branch 4 times, most recently from e7d33e1 to 39216e7 Compare September 23, 2024 15:39

rhatdan reviewed Sep 23, 2024

View reviewed changes

ramalama/cli.py Outdated Show resolved Hide resolved

Add basic vllm support

65278f9

This will only work with --nocontainer for now. Requires vllm > 6.0 . Signed-off-by: Eric Curtin <[email protected]>

ericcurtin force-pushed the vllm branch from 39216e7 to 65278f9 Compare September 23, 2024 18:46

rhatdan merged commit 8c48bbd into main Sep 23, 2024
13 checks passed

ericcurtin deleted the vllm branch September 23, 2024 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic vllm support #97

Add basic vllm support #97

ericcurtin commented Aug 29, 2024 •

edited

Loading

ericcurtin commented Aug 29, 2024 •

edited

Loading

rhatdan commented Aug 29, 2024

ericcurtin commented Aug 29, 2024 •

edited

Loading

ericcurtin commented Aug 29, 2024 •

edited

Loading

ericcurtin commented Aug 29, 2024

kannon92 commented Aug 31, 2024

ericcurtin commented Aug 31, 2024

matbee-eth commented Sep 1, 2024

ericcurtin commented Sep 2, 2024 •

edited

Loading

packit-as-a-service bot commented Sep 21, 2024

ericcurtin commented Sep 21, 2024

rhatdan commented Sep 23, 2024

ericcurtin commented Sep 24, 2024

cdoern commented Sep 26, 2024

ericcurtin commented Sep 26, 2024 •

edited

Loading

Add basic vllm support #97

Add basic vllm support #97

Conversation

ericcurtin commented Aug 29, 2024 • edited Loading

ericcurtin commented Aug 29, 2024 • edited Loading

rhatdan commented Aug 29, 2024

ericcurtin commented Aug 29, 2024 • edited Loading

ericcurtin commented Aug 29, 2024 • edited Loading

ericcurtin commented Aug 29, 2024

kannon92 commented Aug 31, 2024

ericcurtin commented Aug 31, 2024

matbee-eth commented Sep 1, 2024

ericcurtin commented Sep 2, 2024 • edited Loading

packit-as-a-service bot commented Sep 21, 2024

ericcurtin commented Sep 21, 2024

rhatdan commented Sep 23, 2024

ericcurtin commented Sep 24, 2024

cdoern commented Sep 26, 2024

ericcurtin commented Sep 26, 2024 • edited Loading

ericcurtin commented Aug 29, 2024 •

edited

Loading

ericcurtin commented Aug 29, 2024 •

edited

Loading

ericcurtin commented Aug 29, 2024 •

edited

Loading

ericcurtin commented Aug 29, 2024 •

edited

Loading

ericcurtin commented Sep 2, 2024 •

edited

Loading

ericcurtin commented Sep 26, 2024 •

edited

Loading