Skip to content

opencl: allow large buffer for adreno#20997

Merged
max-krasnyansky merged 1 commit into
ggml-org:masterfrom
qualcomm:lh/adreno-large-buffer
Mar 26, 2026
Merged

opencl: allow large buffer for adreno#20997
max-krasnyansky merged 1 commit into
ggml-org:masterfrom
qualcomm:lh/adreno-large-buffer

Conversation

@lhez
Copy link
Copy Markdown
Contributor

@lhez lhez commented Mar 25, 2026

Overview

OpenCL has a limit on the maximum allocation size for buffers (can be queried using CL_DEVICE_MAX_MEM_ALLOC_SIZE). Some Adreno GPUs allow allocating buffers beyond this limit by using an extension (although it does not guarantee to allocate buffer as large as the entire DRAM). This allows larger compute buffer and larger context.

This PR adds an env var GGML_OPENCL_ADRENO_USE_LARGE_BUFFER to enable this extension. If this env var exists and the GPU is Adreno and this extension is supported, this extension will be used to allocate buffers that go beyond the limit defined by CL_DEVICE_MAX_MEM_ALLOC_SIZE.

Additional information

The extension is cl_qcom_large_buffer. Relevant documentation can be found in Adreno OpenCL SDK documentation (the SDK can be downloaded from https://softwarecenter.qualcomm.com/catalog/item/Adreno_OpenCL_SDK).

Android platform with A7x and A8x GPU should support it. X Elite (Windows) does not support it at the moment. The upcoming X2 Elite (Windows) should also support it.

For example, GGML_OPENCL_ADRENO_USE_LARGE_BUFFER=1 allows Qwen3-0.6B to run on A740 Android device with context length 40960.

Requirements

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Mar 25, 2026
@lhez lhez marked this pull request as ready for review March 26, 2026 06:01
@lhez lhez requested a review from a team as a code owner March 26, 2026 06:02
@lhez lhez requested review from CISC and ggerganov March 26, 2026 06:26
@max-krasnyansky max-krasnyansky merged commit ded446b into ggml-org:master Mar 26, 2026
49 of 50 checks passed
slartibardfast pushed a commit to slartibardfast/llama.cpp that referenced this pull request Apr 12, 2026
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
a-ghorbani added a commit to a-ghorbani/pocketpal-ai that referenced this pull request May 11, 2026
Set LM_GGML_OPENCL_ADRENO_USE_LARGE_BUFFER=1 before SoLoader.init so the
llama.rn OpenCL backend enables cl_qcom_large_buffer on supported Adreno
devices. Non-Adreno devices and drivers without the extension no-op.

Closes #657.
Upstream: ggml-org/llama.cpp#20997
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants