compute types per platform #16730
Unanswered
okuvshynov
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
If I get some model in gguf format (say, glm-air-q8), use same type for kv cache (f16) and run it:
Will the types used for compute and activations be the same? Will we first de-quantize weights and cache to the same type (f16/f32/whatever), will we use same types for operations (operate on f16, accumulate to f32), etc.?
Or, will it depend on platform/kernel implementation on that platform? If it is different, what's the right place (pointers to code?) to learn these differences?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions