-
Notifications
You must be signed in to change notification settings - Fork 731
Qualcomm AI Engine Direct - Quantization Recipe for LLM #15807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15807
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f5b3916 with merge base 3bbe173 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: qualcomm" |
|
Hi @cccclai, This PR includes the Quantization Recipe we went over in today's meeting. cc: @haowhsu-quic |
2d4c061 to
e09726d
Compare
- add a fine-grained quantization annotation mechanism – quantization recipe - applied to Llama3-1B/3B with fine-grained quantization configs
e09726d to
f0f016e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new fine-grained quantization annotation mechanism called "quantization recipe" for LLM models in the Qualcomm AI Engine Direct backend. The new approach replaces the previous custom annotation system with a more flexible and maintainable recipe-based pattern.
Key Changes
- Added
QuantRecipeinfrastructure providing a builder pattern for defining quantization strategies - Implemented model-specific quantization recipes for 14 LLM variants (Llama, Gemma, Qwen, Phi, etc.)
- Migrated LLM model configurations from
custom_annotationtuples toquant_recipeclass references
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
examples/qualcomm/oss_scripts/llama/static_llm_quant_recipe.py |
New file defining StaticLLMQuantRecipe base class and 14 model-specific recipe implementations |
examples/qualcomm/oss_scripts/llama/llama.py |
Updated to use quant_recipe instead of custom_annotations, simplified quantization flow |
examples/qualcomm/oss_scripts/llama/__init__.py |
Removed custom annotation imports/configs, added quant_recipe imports, updated LLMModelConfig to use quant_recipe field |
backends/qualcomm/quantizer/quant_recipe.py |
New core infrastructure with QuantRecipe builder, QuantizationStrategy patterns, and QuantGranularity enum |
backends/qualcomm/quantizer/quantizer.py |
Added recipe support to QnnQuantizer.annotate(), added new use_8a4w QuantDtype |
backends/qualcomm/quantizer/qconfig.py |
Added get_8a4w_qnn_ptq_config() for 8-bit activation, 4-bit weight quantization |
backends/qualcomm/quantizer/custom_annotation.py |
Removed obsolete annotation functions (annotate_down_proj, annotate_output_16a8w, annotate_qkv_proj_sha, StaticLLMQuantConfig) |
docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md |
Updated documentation to reference quant_recipe instead of ptq/group_size configs |
backends/qualcomm/utils/utils.py |
Added show_nn_module_stack_for_quant_recipe() helper for debugging module stacks |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@DannyYuyang-quic thanks for the PR, we've a native executorch.export infra and ExportRecipes (https://github.com/pytorch/executorch/blob/main/export/export.py#L38) for the users to easily use configurations such as these, for example, i added a recipe for QNN - FP16 (https://github.com/pytorch/executorch/blob/main/backends/qualcomm/recipes/qnn_recipe_types.py#L24), would be great if we can expose these quant configs as well for every one to use, this will significantly lower the friction to onboard to QNN. Also note that, if you use ExportRecipes, you don't have to use CC: @cccclai |
|
@abhinaykukkadapu this PR is different than the export recipe you added. It's about how to add more customization to quantize a model. The current recipe for different backends doesn't offer a way to this level of customization and we need to either expose some API or leave it for advanced users only. |
|
Hi @abhinaykukkadapu, @cccclai, @abhinaykukkadapu for now, this PR does not use |
Thanks for your work and for letting me know, yes, this would be great, if we expose these complex configs as |
Summary: Forward fix test failure in pytorch#15807 Main reason is that this API is called internally. In this PR, I recovered some of the deleted functions in the previous PRs Differential Revision: D87566729
Summary: Forward fix test failure in pytorch#15807 Main reason is that this API is called internally. In this PR, I recovered some of the deleted functions in the previous PRs Differential Revision: D87566729
Summary: Forward fix test failure in pytorch#15807 Main reason is that this API is called internally. In this PR, I recovered some of the deleted functions in the previous PRs Reviewed By: abhinaykukkadapu Differential Revision: D87566729
Summary: Forward fix test failure in pytorch#15807 Main reason is that this API is called internally. In this PR, I recovered some of the deleted functions in the previous PRs Reviewed By: abhinaykukkadapu Differential Revision: D87566729
Summary: Forward fix test failure in pytorch#15807 Main reason is that this API is called internally. In this PR, I recovered some of the deleted functions in the previous PRs Reviewed By: abhinaykukkadapu Differential Revision: D87566729
Summary
Qualcomm AI Engine Direct - Quantization Recipe for LLM
Test plan
All LLM CI under TestExampleLLMScript: