You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am not sure if we can implement this change while maintaining compatibility with existing models without breaking mmap, since we need to modify the layout of the tensors. I think that maintaining backwards compatibility with models with split experts is important, we should not ask people to re-download 50GB models, but we may have to disable mmap with old models.
Currently, we store separate tensors for each expert:
https://github.com/ggerganov/llama.cpp/blob/3020327f6cd6d2ce50528dd65f4b199d2ea8b1ae/ggml.c#L4442-L4455
This leads to large number of possible "source" tensors for the
_id
ops which increases significantly the size ofstruct ggml_tensor
on the stack:https://github.com/ggerganov/llama.cpp/blob/3020327f6cd6d2ce50528dd65f4b199d2ea8b1ae/ggml.h#L573-L576
Additionally, the Metal implementation is currently hacked to support up to 8 experts and extension to more than that is not completely obvious:
https://github.com/ggerganov/llama.cpp/blob/3020327f6cd6d2ce50528dd65f4b199d2ea8b1ae/ggml-metal.m#L1750-L1759
We should improve this, with one possible way being to store the data for the experts into a single tensor and address is with appropriate offsets
The text was updated successfully, but these errors were encountered: