Not capping thread count when MoE inference is running on CPU #5419

ptsochantaris · 2024-02-08T18:23:49Z

This seems to fix a performance ceiling when inferring MoE on CPU, as described here: #5417

kalomaze · 2024-02-08T21:47:22Z

Will see if this fixes my Mixtral prompt processing regression on my intel i5.

kalomaze · 2024-02-08T22:18:52Z

Before:

After:

Thread usage seems a lot more balanced now (as it used to be), and prompt processing for pure CPU q8_0 Mixtral drops to 100ms per token instead of 125ms per token.

More specifically, I get 9-10t/s prompt processing instead of 7-8t/s prompt processing on pure CPU alone for q8_0.

@JohannesGaessler it seems the regression for partial offloading that I mentioned to you was in fact CPU related, and this PR should fix it :)

* Not capping thread count when MoE inference is running on CPU * Whitespace

ptsochantaris added 2 commits February 8, 2024 18:22

Not capping thread count when MoE inference is running on CPU

f8dc954

Whitespace

d5a6e86

slaren approved these changes Feb 8, 2024

View reviewed changes

slaren requested a review from ggerganov February 8, 2024 18:32

kalomaze mentioned this pull request Feb 8, 2024

Thread hanging issues after the sched_yield changes / Intel CPU slowdowns related to E-cores #5225

Closed

ggerganov approved these changes Feb 9, 2024

View reviewed changes

ggerganov merged commit e5ca393 into master Feb 9, 2024
57 checks passed

ptsochantaris deleted the moe-cpu-thread-cap branch February 12, 2024 19:28

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024

llama : do not cap thread count when MoE on CPU (ggerganov#5419)

87ebe8a

* Not capping thread count when MoE inference is running on CPU * Whitespace

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

llama : do not cap thread count when MoE on CPU (ggerganov#5419)

4b3975f

* Not capping thread count when MoE inference is running on CPU * Whitespace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not capping thread count when MoE inference is running on CPU #5419

Not capping thread count when MoE inference is running on CPU #5419

ptsochantaris commented Feb 8, 2024

kalomaze commented Feb 8, 2024

kalomaze commented Feb 8, 2024 •

edited

Loading

Not capping thread count when MoE inference is running on CPU #5419

Not capping thread count when MoE inference is running on CPU #5419

Conversation

ptsochantaris commented Feb 8, 2024

kalomaze commented Feb 8, 2024

kalomaze commented Feb 8, 2024 • edited Loading

kalomaze commented Feb 8, 2024 •

edited

Loading