load lower quant to vram and higher to ram as a reference for performance #10287
AncientMystic
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Would it be possible to use multiple model quantisations such as Q1-Q4 (possibly even q0.5 for extremely large models?) to load to vram then load q5-q8 or even f16 into ram.
Possibly even to have a setting to load a model into ram then create a reduced copy in vram.
then we could use the low quant model or even extremely low quant model that would never really be useful otherwise to load in vram for initial processing and then reference the higher quant model in ram to refine the tokens and restore anything missing / increase their quality back to original or near original before usage in the response generated.
This could maybe be a way to fast run the model on the GPU, cherry pick from it and go back to refine the data quality. This could be used to attempt to produce a response at least somewhat similar to what is in ram except with a much lower vram impact and a lower burden when it comes to the throughput requirements of ram since the gpu would be doing all the hard work and this would make it possible to run much larger models without so much obvious quality loss at higher quants.
Although it would increase overall ram usage, effectively nearly loading every model used to both vram and ram, maybe it could benefit performance, as simply a way to utilise the absolute minimum quantisation on the gpu without actually sacrificing quite as much quality.
A way to have the best of both worlds if you have enough ram.
(Assuming you wouldn't have to reprocess the entire model a second time in ram to reference specific tokens anyways.)
Beta Was this translation helpful? Give feedback.
All reactions