Qualcomm AI Engine Direct - Quantization Recipe for LLM #15807

DannyYuyang-quic · 2025-11-13T09:12:47Z

Summary

Qualcomm AI Engine Direct - Quantization Recipe for LLM

add a fine-grained quantization annotation mechanism – quantization recipe
applied to LLM models with fine-grained quantization configs

Test plan

All LLM CI under TestExampleLLMScript:

python -m backends.qualcomm.tests.test_qnn_delegate.TestExampleLLMScript -s ${device_id} -H ${host_id} -m ${soc} -b build-android

pytorch-bot · 2025-11-13T09:12:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15807

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f5b3916 with merge base 3bbe173 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2025-11-13T09:13:09Z

@pytorchbot label "release notes: qualcomm"

DannyYuyang-quic · 2025-11-13T09:17:46Z

Hi @cccclai,

This PR includes the Quantization Recipe we went over in today's meeting.
It introduces fine-grained quantization annotation for current LLM models we have.
Please have a look.
Thanks!

cc: @haowhsu-quic

- add a fine-grained quantization annotation mechanism – quantization recipe - applied to Llama3-1B/3B with fine-grained quantization configs

backends/qualcomm/quantizer/quant_recipe.py

Copilot

Pull Request Overview

This PR introduces a new fine-grained quantization annotation mechanism called "quantization recipe" for LLM models in the Qualcomm AI Engine Direct backend. The new approach replaces the previous custom annotation system with a more flexible and maintainable recipe-based pattern.

Key Changes

Added QuantRecipe infrastructure providing a builder pattern for defining quantization strategies
Implemented model-specific quantization recipes for 14 LLM variants (Llama, Gemma, Qwen, Phi, etc.)
Migrated LLM model configurations from custom_annotation tuples to quant_recipe class references

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`examples/qualcomm/oss_scripts/llama/static_llm_quant_recipe.py`	New file defining `StaticLLMQuantRecipe` base class and 14 model-specific recipe implementations
`examples/qualcomm/oss_scripts/llama/llama.py`	Updated to use quant_recipe instead of custom_annotations, simplified quantization flow
`examples/qualcomm/oss_scripts/llama/__init__.py`	Removed custom annotation imports/configs, added quant_recipe imports, updated LLMModelConfig to use quant_recipe field
`backends/qualcomm/quantizer/quant_recipe.py`	New core infrastructure with `QuantRecipe` builder, `QuantizationStrategy` patterns, and `QuantGranularity` enum
`backends/qualcomm/quantizer/quantizer.py`	Added recipe support to `QnnQuantizer.annotate()`, added new `use_8a4w` QuantDtype
`backends/qualcomm/quantizer/qconfig.py`	Added `get_8a4w_qnn_ptq_config()` for 8-bit activation, 4-bit weight quantization
`backends/qualcomm/quantizer/custom_annotation.py`	Removed obsolete annotation functions (`annotate_down_proj`, `annotate_output_16a8w`, `annotate_qkv_proj_sha`, `StaticLLMQuantConfig`)
`docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md`	Updated documentation to reference quant_recipe instead of ptq/group_size configs
`backends/qualcomm/utils/utils.py`	Added `show_nn_module_stack_for_quant_recipe()` helper for debugging module stacks

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backends/qualcomm/quantizer/quant_recipe.py

backends/qualcomm/quantizer/qconfig.py

examples/qualcomm/oss_scripts/llama/static_llm_quant_recipe.py

backends/qualcomm/quantizer/quant_recipe.py

abhinaykukkadapu · 2025-11-17T23:26:46Z

@DannyYuyang-quic thanks for the PR, we've a native executorch.export infra and ExportRecipes (https://github.com/pytorch/executorch/blob/main/export/export.py#L38) for the users to easily use configurations such as these, for example, i added a recipe for QNN - FP16 (https://github.com/pytorch/executorch/blob/main/backends/qualcomm/recipes/qnn_recipe_types.py#L24), would be great if we can expose these quant configs as well for every one to use, this will significantly lower the friction to onboard to QNN.

Also note that, if you use ExportRecipes, you don't have to use to_edge_transform_and_lower_to_qnn as the recipe infra takes care of transforms before lowering. Let me know if you have any questions. Thanks!

CC: @cccclai

cccclai · 2025-11-18T00:07:51Z

@abhinaykukkadapu this PR is different than the export recipe you added. It's about how to add more customization to quantize a model. The current recipe for different backends doesn't offer a way to this level of customization and we need to either expose some API or leave it for advanced users only.

DannyYuyang-quic · 2025-11-18T03:08:24Z

Hi @abhinaykukkadapu, @cccclai,
Thanks for the feedback, and thanks Chen for clarifying!
Like Chen said, the goal of this PR is mainly to support more customization to quantize a model.

@abhinaykukkadapu for now, this PR does not use ExportRecipes.
And regarding exposing these quant configs in ExportRecipes, we’re currently working on refactoring the qconfig.py and QNNQuantizer, so we can discuss how to integrate this in a follow-up PR.

`

abhinaykukkadapu · 2025-11-18T05:16:53Z

@DannyYuyang-quic

And regarding exposing these quant configs in ExportRecipes, we’re currently working on refactoring the qconfig.py and QNNQuantizer, so we can discuss how to integrate this in a follow-up PR.

Thanks for your work and for letting me know, yes, this would be great, if we expose these complex configs as ExportRecipes (in future), users can just lower a model with just a couple of lines of code.

meta-codesync · 2025-11-18T17:55:49Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D87349343.

Summary: Forward fix test failure in pytorch#15807 Main reason is that this API is called internally. In this PR, I recovered some of the deleted functions in the previous PRs Differential Revision: D87566729

Summary: Forward fix test failure in pytorch#15807 Main reason is that this API is called internally. In this PR, I recovered some of the deleted functions in the previous PRs Reviewed By: abhinaykukkadapu Differential Revision: D87566729

DannyYuyang-quic requested review from cccclai and mergennachin as code owners November 13, 2025 09:12

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 13, 2025

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Nov 13, 2025

DannyYuyang-quic force-pushed the dev1/danny/per_layer_quant branch from 2d4c061 to e09726d Compare November 14, 2025 09:01

Qualcomm AI Engine Direct - Quantization Recipe for LLM

f0f016e

- add a fine-grained quantization annotation mechanism – quantization recipe - applied to Llama3-1B/3B with fine-grained quantization configs

DannyYuyang-quic force-pushed the dev1/danny/per_layer_quant branch from e09726d to f0f016e Compare November 17, 2025 14:08

cccclai reviewed Nov 17, 2025

View reviewed changes

backends/qualcomm/quantizer/quant_recipe.py Show resolved Hide resolved

cccclai approved these changes Nov 17, 2025

View reviewed changes

mergennachin requested a review from Copilot November 17, 2025 23:15

Copilot started reviewing on behalf of mergennachin November 17, 2025 23:15 View session

mergennachin requested a review from abhinaykukkadapu November 17, 2025 23:16

Copilot finished reviewing on behalf of mergennachin November 17, 2025 23:19

Copilot AI reviewed Nov 17, 2025

View reviewed changes

fix typos and resolve code review issues

f5b3916

`

cccclai merged commit 101e915 into pytorch:main Nov 18, 2025
145 checks passed

cccclai mentioned this pull request Nov 20, 2025

forward fix unit test #15926

Open

Qualcomm AI Engine Direct - Quantization Recipe for LLM #15807

Qualcomm AI Engine Direct - Quantization Recipe for LLM #15807

Uh oh!

Conversation

DannyYuyang-quic commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15807

✅ No Failures

Uh oh!

DannyYuyang-quic commented Nov 13, 2025

Uh oh!

DannyYuyang-quic commented Nov 13, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinaykukkadapu commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cccclai commented Nov 18, 2025

Uh oh!

DannyYuyang-quic commented Nov 18, 2025

Uh oh!

abhinaykukkadapu commented Nov 18, 2025

Uh oh!

meta-codesync bot commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DannyYuyang-quic commented Nov 13, 2025 •

edited

Loading

pytorch-bot bot commented Nov 13, 2025 •

edited

Loading

abhinaykukkadapu commented Nov 17, 2025 •

edited

Loading