Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not capping thread count when MoE inference is running on CPU #5419

Merged
merged 2 commits into from
Feb 9, 2024

Conversation

ptsochantaris
Copy link
Collaborator

This seems to fix a performance ceiling when inferring MoE on CPU, as described here: #5417

@slaren slaren requested a review from ggerganov February 8, 2024 18:32
@kalomaze
Copy link
Contributor

kalomaze commented Feb 8, 2024

Will see if this fixes my Mixtral prompt processing regression on my intel i5.

@kalomaze
Copy link
Contributor

kalomaze commented Feb 8, 2024

Before:
image

After:
image

Thread usage seems a lot more balanced now (as it used to be), and prompt processing for pure CPU q8_0 Mixtral drops to 100ms per token instead of 125ms per token.

More specifically, I get 9-10t/s prompt processing instead of 7-8t/s prompt processing on pure CPU alone for q8_0.

@JohannesGaessler it seems the regression for partial offloading that I mentioned to you was in fact CPU related, and this PR should fix it :)

@ggerganov ggerganov merged commit e5ca393 into master Feb 9, 2024
57 checks passed
@ptsochantaris ptsochantaris deleted the moe-cpu-thread-cap branch February 12, 2024 19:28
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* Not capping thread count when MoE inference is running on CPU

* Whitespace
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Not capping thread count when MoE inference is running on CPU

* Whitespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants