GGML philosophy question... #16707
Replies: 1 comment
-
This is mostly true today, but I think with time we can improve. Technically there are almost no limitations of what you can implement with
It's not exponential - it is a linear function of the number of operators.
I'm not familiar with Executorch, but I have thought about the "graph export" feature and I don't really see any arguments for it. Writing a graph in code has no disadvantage, and maybe even has various advantages compared to exporting it to some format.
I think we have a very good software architecture that can be extended across every hardware and model. I don't really have a good knowledge about the architecture of other frameworks to comment, but I would not be surprised if our approach has the best ratio of capabilities/complexity. We do need to pay attention to the architecture and the engineering process and there are certainly many things we can improve and do better. Hoping that with time as the project continues to grow and becomes more and more adopted, we will attract good engineers to help us in this regard. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey, so this is a bit weird, but I have been starting a project for edge inference.
I had looked at ggml/llama-cpp in the past, but it always seemed like it was focused on the 'happy path' of decoder only models, running on common devices, that are not doing anything exceptionally weird. Since I want to do things like KV cache manipulation and cross attention cache stuff with encoder/decoder models. Maybe I am missing something but what bothered me is that it seems like you have to hand make a new runtime inference model, for each new model you support.
Combined with multiple backends for CPU, MLX, NVIDIA, AMD, and sometimes custom NPU. you end up getting a exponential explosion of combinations of edge cases you need to support.
I guess I am comparing this to Executorch, which has an annoyingly complex export process, but exports a graph at the end of the day, which can be run on a given backend, maybe I am wrong but it seems like a cleaner method to go from a HF server model to an edge deployed model (easier to test with fewer areas of divergence in behavior).
I am not 100% sure what I am asking, but how maintainable/extensable is the software architecture as new models types come out, is this something that you guys have thought about in designing the current architecture? (I think I saw some of this referenced with regards to GGUF model format) Or am I missing the plot entirely.
Beta Was this translation helpful? Give feedback.
All reactions