-
Notifications
You must be signed in to change notification settings - Fork 13.4k
llama: automatically set runtime parameters such as --n-gpu-layers to fit VRAM #14067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Looking forward to this, I've been setting this as 999 when I wanted GPU acceleration and users have complained it's too high. |
|
Since VRAM required to run the model is majorly dependant on 1) Model size and 2) Allocated Context, please consider adding the following flags:
Maybe even the following flags can be useful. Not sure.
In the past, I've been using https://github.com/3Simplex/Llama.Cpp-Toolbox, which features automatic determination of context and layers, with a Nvidia Geforce 1060 3GB, but the solution there is imperfect, so I've had my fair share of pondering how to make this more right. At time of writing, I am not proficient at coding in c++, so please excuse me, all these problems can be solved in a better way. |
|
My intent is to make the targeted VRAM margin and the min. context size configurable (and to only adjust runtime parameters not explicitly set by the user). That should cover most use cases. It is my opinion that the logic for optimizing runtime parameters should be kept simple since it's not feasible to cover all possible use cases and hardware setups anyways. If someone wants to squeeze out the last few % of performance for their setup they should determine the optimal parameters manually and save them somewhere. |
|
Will this logic be backend agnostic? Is it possible that different backends would require different amounts of VRAM (eg Vulkan vs cuda) even with the exact same layers and generation params? |
|
It will work for all GPU backends, the required VRAM margin will not be the same. One problem is that right now for example the CUDA backend allocates temporary buffers for data conversion that are not part of the compute graph and are therefore not being considered for the expected VRAM use. Long-term I think we should move these conversions to the compute graph anyways since that would also have the benefit of lower memory use and reusability when the compute graph splits. |
|
(I have not tried or checked the code) |
|
For multiple GPUs and |
|
A somewhat related issue is being explored here: sometimes users set -ngl 999 to enable acceleration and -ngl 0 to disable it, very boolean. Kinda a big hammer approach. It is also being claimed that ggml-cpu (no vulkan built in) performs significantly faster that ggml-vulkan with -ngl 0, is this correct? (I think I assumed in the past they should be the same without looking into the details) I also wonder should we by default auto-select -ngl 0 when CPU is detected in vulkan: |
I was worried about this but it's only supposed to affect things that aren't explicitly set, and using -ngl 99 -ot exps=CPU is creating an expected desirable effect so if this broke it somehow it would be a bug to fix. |
|
Would --auto-max-context default to the model config's full context size? |
My idea was letting the user set a value there. Even if the model supports 32768 context, if the user sets it to 8192, then allocated VRAM should be maxed out to accomodate as many layers of the model first and then max out to 8192 context and then stop there. If there is not enough space in VRAM left to fill up to 8192, then go as high as possible and max out VRAM, but not context. Since users don't know how many layers of the model will fit into VRAM for a given context, allocating the layers automatically is nice. Number and thickness of layers differs between models. So does the VRAM required for the context (e.g. flash attention does require less VRAM than the default llama.cpp setting), so IMHO, both context and layers are important variables to account for and whatever is maxed out first of the two will determine the values for the other variable in case of limited VRAM. The order is important here. I am also operating under the assumption that automatically set runtime parameters are imperfect and max out VRAM in cases, when it is not desired to max out VRAM and argue for a tiny "empty" margin to prevent fully maxing out VRAM, hence the |
|
You have a cool idea but you're trying to juggle too many variables when designing the user experience and it's turning into an inconsistent mess. Set sane defaults assuming the user doesn't even know the switches exist, then let people work do find tune it from there. (Though at some point what's the effort difference between what you're making and just manually setting context and testing vram use? Now we gotta set like two or three different context numbers to 'shoot the gap' and have the dry runner figure out the difference) Also shouldn't there just be a way to estimate cache use mathematically? I know it'll vary if you use flash attention or swa or such, but, theoretically there should be no reason to need to estimate to that degree. In exllamav3 for instance ref, you can get in the ballpark with |
|
Just so there is no misunderstanding: @ThiloteE is not writing any of the code in this PR, he merely made suggestions regarding the interface. I am the one working on this feature and my response was:
|
|
Tested your branch and it seems to work - any chance for a switch to manually define the vram target if auto-detect gets the wrong value? (example: Mobile nvidia gpus sometimes under-report their vram on windows despite that vram being able to get manually used fine even with swap-mem forced off) |
|
I intend to make the targeted VRAM margin configurable (one value per GPU). If a signed integer is used and you specify a negative value the code will try to allocate more memory per GPU than is assumed to be available. |
Yes. I just made a few proposals and I think it was good having them discussed and evaluated. Such is the beauty of Open Source. Thank you @JohannesGaessler for working on this. All credit to you. |

See #13860 .
This PR aims to add code for setting runtime parameters such as the number of GPU layers automatically given the available memory. As of right now only the number of GPU layers is being adjusted and this is done unconditionally. Implementation details:
llama_expected_memory_useto retrieve the expected memory use per device without doing any actual allocations. A functioncommon_fit_to_free_memoryincommon.cppthen repeatedly tries configurations of runtime parameters until the allocation would fit in free memory (plus some margin). I think the code for determining the optimal parameters should work in such a way that only parameters that the user does not explicitly set are modified. To me the most natural way to do this would be incommon.cppthough it could also be done inllama.cpp.llama_modelandllama_contextare extended with a flagdry_runwhich, when set, prevents the allocation of memory during initialization.dry_runcannot be set by user code.llama_modelhas been extended with a methodtotal_sizethat returns the size of the weights.llama_contexthas been extended with a methodtotal_sizethat internally calls the same method on the memory, thus returning the size of the KV cache.ggml_backend_alloc_ctx_tensors_from_buft_sizewhich returns the amount of memory that would be needed for a call toggml_backend_alloc_ctx_tensors_from_buft. Both functions internally use the same code but a new flagdry_runcontrols whether the memory is actually being allocated.dry_runflag inllama_modelandllama_contextresults in the creation of dummy backend buffers with size 0,ggml_backend_buffer_get_sizecannot be used to retrieve the expected memory use intotal_size. Insteadggml_backend_alloc_ctx_tensors_from_buft_sizeis used. This makes the corresponding methods for the memory awkward: right now they retrieve the expected memory use of the KV cache even if actual, physical buffers have been allocated. I'm not sure what the best course of action here is; maybe use the expected size withdry_runand the actually allocated size withoutdry_runand assert consistency in debug mode?llama_contexthas a new vectorbackends_exp_max_sizeto store the expected max. memory use given the worst-case compute graphs that are already being pre-allocated on master. Ifdry_runis set a new functionggml_backend_sched_reserve_sizeis used to retrieve the expected allocation size of the scheduler instead of an actual call toggml_backend_sched_reserve. Independently of the automatic determination of runtime parameters, I think it would be useful to track the max. expected memory use of the scheduler and to assert that it was not exceeded in the destructor ofllama_context.ggml_backend_sched_reserve_sizecallsggml_gallocr_reserve_nwith a new flagdry_runand afterwards calls a new functionggml_gallocr_get_max_sizeto retrieve the max. sizes of the internally storedggml_dyn_tallocrs.ggml_backend_dev_tto filter the memory use by. I'm not sure whether I should be filtering byggml_backend_buffer_type_tinstead.llama_context::backendsvs.llama_context::backend_ptrs.vocab_onlywhich does very similar things todry_run. Maybe we should unify the logic using an enum?For the finished PR I envision the following behavior: