ggml-alloc : make gallocr prefer chunks that allow memory reuse #16788
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Small improvement to graph allocation with multiple buffers/chunks:
In the case where a tensor is allocated, and no free block fits, the current implementation allocates additional memory in the first chunk that can fit the tensor into the max size. The last block can contain both reusable (previously allocated then freed) memory, as well as memory not allocated yet. This PR prioritizes chunks with reusable memory that fits the tensor to reduce total allocation size.
See #16759 for an example.
Vulkan compute buffer size for
llama-bench --model llama-2-7b.Q4_0.gguf --n-gpu-layers 19 --ubatch-size 512:--n-prompt 12200--n-prompt 12500--n-prompt 13500--n-prompt 14500--n-prompt 15500I tested some other models and they show similar behavior around the 1024 MiB threshold.