Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoE loading time regression #6798

Closed
jart opened this issue Apr 20, 2024 · 2 comments
Closed

MoE loading time regression #6798

jart opened this issue Apr 20, 2024 · 2 comments

Comments

@jart
Copy link
Contributor

jart commented Apr 20, 2024

Three weeks ago #6387 removed mmap() support for MoE models. This causes Mixtral 8x7b F16 to take 30x longer to load on my Threadripper w/ 5200 MT/s RAM. It used to take 2 seconds to load. Now it takes 56 seconds to load.

image

Can we reconsider this? I would rather have 3d tensor creation be a 1-time cost in the conversion script, rather than happening each time the llama.cpp process spawns.

@jart
Copy link
Contributor Author

jart commented Apr 20, 2024

Please disregard. Based on closer analysis, assuming I'm understanding things correctly, all I need to do is re-convert and re-quantize my weights.

@jart jart closed this as completed Apr 20, 2024
@ggerganov
Copy link
Member

Yes, simply reconverting + requantize should resolve the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants