Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

See #13860 .

This PR aims to add code for setting runtime parameters such as the number of GPU layers automatically given the available memory. As of right now only the number of GPU layers is being adjusted and this is done unconditionally. Implementation details:

  • The llama.cpp API is extended with a function llama_expected_memory_use to retrieve the expected memory use per device without doing any actual allocations. A function common_fit_to_free_memory in common.cpp then repeatedly tries configurations of runtime parameters until the allocation would fit in free memory (plus some margin). I think the code for determining the optimal parameters should work in such a way that only parameters that the user does not explicitly set are modified. To me the most natural way to do this would be in common.cpp though it could also be done in llama.cpp.
  • The constructors for llama_model and llama_context are extended with a flag dry_run which, when set, prevents the allocation of memory during initialization. dry_run cannot be set by user code.
  • llama_model has been extended with a method total_size that returns the size of the weights.
  • llama_context has been extended with a method total_size that internally calls the same method on the memory, thus returning the size of the KV cache.
  • Added a new function ggml_backend_alloc_ctx_tensors_from_buft_size which returns the amount of memory that would be needed for a call to ggml_backend_alloc_ctx_tensors_from_buft. Both functions internally use the same code but a new flag dry_run controls whether the memory is actually being allocated.
  • Because the dry_run flag in llama_model and llama_context results in the creation of dummy backend buffers with size 0, ggml_backend_buffer_get_size cannot be used to retrieve the expected memory use in total_size. Instead ggml_backend_alloc_ctx_tensors_from_buft_size is used. This makes the corresponding methods for the memory awkward: right now they retrieve the expected memory use of the KV cache even if actual, physical buffers have been allocated. I'm not sure what the best course of action here is; maybe use the expected size with dry_run and the actually allocated size without dry_run and assert consistency in debug mode?
  • llama_context has a new vector backends_exp_max_size to store the expected max. memory use given the worst-case compute graphs that are already being pre-allocated on master. If dry_run is set a new function ggml_backend_sched_reserve_size is used to retrieve the expected allocation size of the scheduler instead of an actual call to ggml_backend_sched_reserve. Independently of the automatic determination of runtime parameters, I think it would be useful to track the max. expected memory use of the scheduler and to assert that it was not exceeded in the destructor of llama_context.
  • ggml_backend_sched_reserve_size calls ggml_gallocr_reserve_n with a new flag dry_run and afterwards calls a new function ggml_gallocr_get_max_size to retrieve the max. sizes of the internally stored ggml_dyn_tallocrs.
  • The functions/methods to retrieve the expected memory use generally take an argument of type ggml_backend_dev_t to filter the memory use by. I'm not sure whether I should be filtering by ggml_backend_buffer_type_t instead.
  • I'm not sure about the correct use of llama_context::backends vs. llama_context::backend_ptrs.
  • I'm not sure about the naming of the new functionality in general (e.g. "dry_run" vs. "dry_alloc" or whether the same name with "_size" appended is intuitive).
  • After I did the implementation I noticed that there is a flag vocab_only which does very similar things to dry_run. Maybe we should unify the logic using an enum?

For the finished PR I envision the following behavior:

  • Check whether the model with max. GPU layers fits in VRAM, simply return if yes.
  • Try cutting down the context to leave room for at least 32k or the length of the prompt, whichever is higher. Keep the reduced context size.
  • Try reducing the physical batch size. Reset to original batch size if unsuccessful.
  • For a non-MoE model, try reducing the number of GPU layers.
  • For a MoE model, try only dense weights in VRAM. If successful, try adding MoE weights to VRAM. If not, try reducing dense weights in VRAM.
  • For a multi GPU setup the above logic should also work by employing standard numerical optimization techniques. It will require additional test allocations but those seem to be sufficiently fast.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 8, 2025
@ericcurtin
Copy link
Collaborator

Looking forward to this, I've been setting this as 999 when I wanted GPU acceleration and users have complained it's too high.

@ThiloteE
Copy link
Contributor

ThiloteE commented Jun 9, 2025

Since VRAM required to run the model is majorly dependant on 1) Model size and 2) Allocated Context, please consider adding the following flags:

  • --auto-max-context: The maximum context to be considered for automatic fitting/mapping to VRAM. Reason: Just because one CAN automatically allocate 1 Million context, it doesn't mean one always WANTS to automatically allocate 1 Million context, if it causes a drop in tokens/second or VRAM is needed for other processes on the users system.
  • --auto-min-context: The minimum context to be considered for automatic fitting/mapping to VRAM. Reason: If the model does not fit completely into VRAM, a tradeoff is required. Either decrease context or decrease --n-gpu-layers. --auto-min-context would ensure that a minimum amount of context is accounted for and if push comes to shove, the automatically determined number of gpu layers will have to drop.
  • --auto-shave-off-n-gpu-layers: Shave off this number of layers from the automatic determined numbers of layers. Reason: If the model just barely fits into VRAM, stopping the layers to maxing out VRAM can prevent complications with other apps and avoid performance drops that stem from inprecise calculations and rough estimations. Basically, this option allows the user to automatically max out number of layers while still retaining a certain size of VRAM that is free to be allocated to other processes.
  • --auto-shave-off-max-context Shave off this amount of context from the automatic determined size of context. Reason: If the model just barely fits into VRAM, stopping the context to maxing out VRAM can prevent complicatiosn with other apps and avoid performance drops that stem from inprecise calculations and rough estimations. Basically, this option allows the user to automatically max out number of layers while still retaining a certain size of VRAM that is free to be allocated to other processes.

Maybe even the following flags can be useful. Not sure.

  • --auto-min-n-gpu-layers
  • --auto-max-n-gpu-layers

In the past, I've been using https://github.com/3Simplex/Llama.Cpp-Toolbox, which features automatic determination of context and layers, with a Nvidia Geforce 1060 3GB, but the solution there is imperfect, so I've had my fair share of pondering how to make this more right. At time of writing, I am not proficient at coding in c++, so please excuse me, all these problems can be solved in a better way.

image

@JohannesGaessler
Copy link
Collaborator Author

My intent is to make the targeted VRAM margin and the min. context size configurable (and to only adjust runtime parameters not explicitly set by the user). That should cover most use cases. It is my opinion that the logic for optimizing runtime parameters should be kept simple since it's not feasible to cover all possible use cases and hardware setups anyways. If someone wants to squeeze out the last few % of performance for their setup they should determine the optimal parameters manually and save them somewhere.

@LostRuins
Copy link
Collaborator

Will this logic be backend agnostic? Is it possible that different backends would require different amounts of VRAM (eg Vulkan vs cuda) even with the exact same layers and generation params?

@JohannesGaessler
Copy link
Collaborator Author

It will work for all GPU backends, the required VRAM margin will not be the same. One problem is that right now for example the CUDA backend allocates temporary buffers for data conversion that are not part of the compute graph and are therefore not being considered for the expected VRAM use. Long-term I think we should move these conversions to the compute graph anyways since that would also have the benefit of lower memory use and reusability when the compute graph splits.

@jacekpoplawski
Copy link
Contributor

jacekpoplawski commented Jun 9, 2025

(I have not tried or checked the code)
I use 2x3090+2x3060 and I discovered that to fit model into the memory I often must also modify -ts in unexpected ways.
When I use large MoE (like Llama 4 Scout or Qwen 3 235B) I am able to use full -ngl with -ot trick.
Are these scenarios affected somehow?

@JohannesGaessler
Copy link
Collaborator Author

For multiple GPUs and --split-mode layer I intend to set --override-tensor in such a way that --n-gpu-layers is effectively being set on a per-GPU basis. For --split-mode row I'll implement support after backend-agnostic tensor parallelism has been implemented.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jun 9, 2025

A somewhat related issue is being explored here:

containers/ramalama#1479

sometimes users set -ngl 999 to enable acceleration and -ngl 0 to disable it, very boolean. Kinda a big hammer approach.

It is also being claimed that ggml-cpu (no vulkan built in) performs significantly faster that ggml-vulkan with -ngl 0, is this correct? (I think I assumed in the past they should be the same without looking into the details)

I also wonder should we by default auto-select -ngl 0 when CPU is detected in vulkan:

Warning: Device type is CPU. This is probably not the device you want. 

@Rotatingxenomorph
Copy link

Rotatingxenomorph commented Jun 9, 2025

When I use large MoE (like Llama 4 Scout or Qwen 3 235B) I am able to use full -ngl with -ot trick. Are these scenarios affected somehow?

I was worried about this but it's only supposed to affect things that aren't explicitly set, and using -ngl 99 -ot exps=CPU is creating an expected desirable effect so if this broke it somehow it would be a bug to fix.

@Midaychi
Copy link

Would --auto-max-context default to the model config's full context size?

@ThiloteE
Copy link
Contributor

ThiloteE commented Jun 10, 2025

Would --auto-max-context default to the model config's full context size?

My idea was letting the user set a value there. Even if the model supports 32768 context, if the user sets it to 8192, then allocated VRAM should be maxed out to accomodate as many layers of the model first and then max out to 8192 context and then stop there. If there is not enough space in VRAM left to fill up to 8192, then go as high as possible and max out VRAM, but not context.
In other words: between --auto-min-context and --auto-max-context, first allocate the --auto-min-context, then automatically adjust --n-gpu-layers and once that is maxed out, go as high for context as possible, (OPTIONAL: but not beyond the value of --auto-max-context). Then shave off some context from the automatically determined value for context with --auto-shave-off-max-context. If I understand right, the proposed "margin" of VRAM is an alternative to --auto-shave-off-max-context.

Since users don't know how many layers of the model will fit into VRAM for a given context, allocating the layers automatically is nice. Number and thickness of layers differs between models. So does the VRAM required for the context (e.g. flash attention does require less VRAM than the default llama.cpp setting), so IMHO, both context and layers are important variables to account for and whatever is maxed out first of the two will determine the values for the other variable in case of limited VRAM. The order is important here.

I am also operating under the assumption that automatically set runtime parameters are imperfect and max out VRAM in cases, when it is not desired to max out VRAM and argue for a tiny "empty" margin to prevent fully maxing out VRAM, hence the --auto-shave-off-max-context or --auto-max-context flags. I've had the case of llama.cpp crashing, because I maxed out VRAM fully. Just offloading one layer less or allocating a little less context avoided the crashes. It's been a few months since, and in the meanwhile, I haven't encountered those crashes anymore, but the argument stands. Automatic context and layer allocation comes in handy, when having multiple models at your disposal and they all differ slightly in size or are trained on various context lengths, so I am thankful for your effort in this PR.

@Midaychi
Copy link

Midaychi commented Jun 11, 2025

You have a cool idea but you're trying to juggle too many variables when designing the user experience and it's turning into an inconsistent mess.

Set sane defaults assuming the user doesn't even know the switches exist, then let people work do find tune it from there. (Though at some point what's the effort difference between what you're making and just manually setting context and testing vram use? Now we gotta set like two or three different context numbers to 'shoot the gap' and have the dry runner figure out the difference)

Also shouldn't there just be a way to estimate cache use mathematically? I know it'll vary if you use flash attention or swa or such, but, theoretically there should be no reason to need to estimate to that degree.

In exllamav3 for instance ref, you can get in the ballpark with num_hidden_layers * max_num_tokens * num_key_value_heads * head_dim * (k_bits + v_bits + 1) / 8 = vram

@JohannesGaessler
Copy link
Collaborator Author

Just so there is no misunderstanding: @ThiloteE is not writing any of the code in this PR, he merely made suggestions regarding the interface. I am the one working on this feature and my response was:

My intent is to make the targeted VRAM margin and the min. context size configurable (and to only adjust runtime parameters not explicitly set by the user). That should cover most use cases. It is my opinion that the logic for optimizing runtime parameters should be kept simple since it's not feasible to cover all possible use cases and hardware setups anyways. If someone wants to squeeze out the last few % of performance for their setup they should determine the optimal parameters manually and save them somewhere.

@Midaychi
Copy link

Tested your branch and it seems to work - any chance for a switch to manually define the vram target if auto-detect gets the wrong value? (example: Mobile nvidia gpus sometimes under-report their vram on windows despite that vram being able to get manually used fine even with swap-mem forced off)

@JohannesGaessler
Copy link
Collaborator Author

I intend to make the targeted VRAM margin configurable (one value per GPU). If a signed integer is used and you specify a negative value the code will try to allocate more memory per GPU than is assumed to be available.

@ThiloteE
Copy link
Contributor

ThiloteE commented Jun 13, 2025

Just so there is no misunderstanding: @ThiloteE is not writing any of the code in this PR, he merely made suggestions regarding the interface. I am the one working on this feature [...]

Yes. I just made a few proposals and I think it was good having them discussed and evaluated. Such is the beauty of Open Source. Thank you @JohannesGaessler for working on this. All credit to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants