opencl: allow large buffer for adreno#20997
Merged
Merged
Conversation
max-krasnyansky
approved these changes
Mar 25, 2026
ggerganov
approved these changes
Mar 26, 2026
slartibardfast
pushed a commit
to slartibardfast/llama.cpp
that referenced
this pull request
Apr 12, 2026
4 tasks
Seunghhon
pushed a commit
to Seunghhon/llama.cpp
that referenced
this pull request
Apr 26, 2026
rsenthilkumar6
pushed a commit
to rsenthilkumar6/llama.cpp
that referenced
this pull request
May 1, 2026
ljubomirj
pushed a commit
to ljubomirj/llama.cpp
that referenced
this pull request
May 6, 2026
a-ghorbani
added a commit
to a-ghorbani/pocketpal-ai
that referenced
this pull request
May 11, 2026
Set LM_GGML_OPENCL_ADRENO_USE_LARGE_BUFFER=1 before SoLoader.init so the llama.rn OpenCL backend enables cl_qcom_large_buffer on supported Adreno devices. Non-Adreno devices and drivers without the extension no-op. Closes #657. Upstream: ggml-org/llama.cpp#20997
my-other-github-account
pushed a commit
to my-other-github-account/llama.cpp
that referenced
this pull request
May 15, 2026
my-other-github-account
pushed a commit
to my-other-github-account/llama.cpp
that referenced
this pull request
May 15, 2026
fewtarius
pushed a commit
to fewtarius/llama.cpp
that referenced
this pull request
May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
OpenCL has a limit on the maximum allocation size for buffers (can be queried using
CL_DEVICE_MAX_MEM_ALLOC_SIZE). Some Adreno GPUs allow allocating buffers beyond this limit by using an extension (although it does not guarantee to allocate buffer as large as the entire DRAM). This allows larger compute buffer and larger context.This PR adds an env var
GGML_OPENCL_ADRENO_USE_LARGE_BUFFERto enable this extension. If this env var exists and the GPU is Adreno and this extension is supported, this extension will be used to allocate buffers that go beyond the limit defined byCL_DEVICE_MAX_MEM_ALLOC_SIZE.Additional information
The extension is
cl_qcom_large_buffer. Relevant documentation can be found in Adreno OpenCL SDK documentation (the SDK can be downloaded from https://softwarecenter.qualcomm.com/catalog/item/Adreno_OpenCL_SDK).Android platform with A7x and A8x GPU should support it. X Elite (Windows) does not support it at the moment. The upcoming X2 Elite (Windows) should also support it.
For example,
GGML_OPENCL_ADRENO_USE_LARGE_BUFFER=1allows Qwen3-0.6B to run on A740 Android device with context length 40960.Requirements