Skip to content

Bump llama-cpp-python to 0.2.23 (NVIDIA & CPU-only, no AMD, no Metal)#4924

Merged
oobabooga merged 1 commit into
mainfrom
bump-llamacpp-mixtral
Dec 14, 2023
Merged

Bump llama-cpp-python to 0.2.23 (NVIDIA & CPU-only, no AMD, no Metal)#4924
oobabooga merged 1 commit into
mainfrom
bump-llamacpp-mixtral

Conversation

@oobabooga
Copy link
Copy Markdown
Owner

@oobabooga oobabooga commented Dec 14, 2023

Adds Mixtral support.

Compiled using GitHub Actions workflows at https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels

The AMD and Metal workflows are failing, so I only have the NVIDIA and CPU wheels for now.

@mjameson
Copy link
Copy Markdown

Awesome, many thanks!!

@oobabooga oobabooga deleted the bump-llamacpp-mixtral branch December 15, 2023 01:07
@Fastmedic
Copy link
Copy Markdown

After this update my token generation speed seems to be about 10x slower on my 3090 running regular LLaMa models.

Also: https://www.reddit.com/r/Oobabooga/s/XqGCaA1Rtm

@Ph0rk0z
Copy link
Copy Markdown
Contributor

Ph0rk0z commented Dec 15, 2023

It has a problem offloading KV cache. I forced it on and it's back to normal but I don't think most users can do that. All other models will be 1/2 speed.

@oobabooga
Copy link
Copy Markdown
Owner Author

That's a bit of a conundrum because the previous version does not support Mixtral. @Ph0rk0z is it necessary to recompile llama-cpp-python to apply this fix? Can it be monkeypatched?

@Ph0rk0z
Copy link
Copy Markdown
Contributor

Ph0rk0z commented Dec 15, 2023

It's not the lib and just the python files. You can edit it under site packages I think. It's not a big patch, it's up in the reddit thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants