MoE loading time regression #6798

jart · 2024-04-20T21:21:58Z

Three weeks ago #6387 removed mmap() support for MoE models. This causes Mixtral 8x7b F16 to take 30x longer to load on my Threadripper w/ 5200 MT/s RAM. It used to take 2 seconds to load. Now it takes 56 seconds to load.

Can we reconsider this? I would rather have 3d tensor creation be a 1-time cost in the conversion script, rather than happening each time the llama.cpp process spawns.

jart · 2024-04-20T21:57:23Z

Please disregard. Based on closer analysis, assuming I'm understanding things correctly, all I need to do is re-convert and re-quantize my weights.

ggerganov · 2024-04-21T12:33:00Z

Yes, simply reconverting + requantize should resolve the issue

jart added the bug-unconfirmed label Apr 20, 2024

jart closed this as completed Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE loading time regression #6798

MoE loading time regression #6798

jart commented Apr 20, 2024

jart commented Apr 20, 2024

ggerganov commented Apr 21, 2024

MoE loading time regression #6798

MoE loading time regression #6798

Comments

jart commented Apr 20, 2024

jart commented Apr 20, 2024

ggerganov commented Apr 21, 2024