Openvino llama support #6

cavusmustafa · 2025-06-23T20:31:08Z

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

BNNS copy crashes the process when the dtypes differ (pytorch#11714). With the example in this PR (pytorch#11714), we crash the process on main. Here is the stack trace from LLDB: ``` Process 19234 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT frame #0: 0x0000000190ac9388 libsystem_kernel.dylib`__pthread_kill + 8 libsystem_kernel.dylib`__pthread_kill: -> 0x190ac9388 <+8>: b.lo 0x190ac93a8 ; <+40> 0x190ac938c <+12>: pacibsp 0x190ac9390 <+16>: stp x29, x30, [sp, #-0x10]! 0x190ac9394 <+20>: mov x29, sp (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT * frame #0: 0x0000000190ac9388 libsystem_kernel.dylib`__pthread_kill + 8 frame #1: 0x0000000190b0288c libsystem_pthread.dylib`pthread_kill + 296 frame #2: 0x0000000190a0bc60 libsystem_c.dylib`abort + 124 frame #3: 0x0000000190910174 libsystem_malloc.dylib`malloc_vreport + 892 frame #4: 0x0000000190913c90 libsystem_malloc.dylib`malloc_report + 64 frame #5: 0x000000019091821c libsystem_malloc.dylib`___BUG_IN_CLIENT_OF_LIBMALLOC_POINTER_BEING_FREED_WAS_NOT_ALLOCATED + 32 frame #6: 0x000000019d2f4084 libBNNS.dylib`___lldb_unnamed_symbol1620 + 564 frame #7: 0x000000019d2f5bac libBNNS.dylib`___lldb_unnamed_symbol1628 + 680 frame #8: 0x000000019d69ce48 libBNNS.dylib`BNNSCopy + 616 frame #9: 0x000000030c74d950 _portable_lib.cpython-310-darwin.so`(anonymous namespace)::copy_using_bnns(executorchcoreml::MultiArray const&, executorchcoreml::MultiArray&) + 188 frame #10: 0x000000030c74cfdc _portable_lib.cpython-310-darwin.so`(anonymous namespace)::copy(executorchcoreml::MultiArray const&, executorchcoreml::MultiArray&, executorchcoreml::MultiArray::CopyOptions) + 72 frame #11: 0x000000030c74ceec _portable_lib.cpython-310-darwin.so`executorchcoreml::MultiArray::copy(executorchcoreml::MultiArray&, executorchcoreml::MultiArray::CopyOptions) const + 148 frame #12: 0x000000030c7488d4 _portable_lib.cpython-310-darwin.so`invocation function for block in (anonymous namespace)::copy(MLMultiArray*, executorchcoreml::MultiArray&) + 376 frame #13: 0x000000030c748ac8 _portable_lib.cpython-310-darwin.so`invocation function for block in (anonymous namespace)::copy(MLMultiArray*, executorchcoreml::MultiArray&) + 52 frame #14: 0x000000019ad33f4c CoreML`CoreML::MultiArrayBuffer::getBytesWithHandler(void (void const*, unsigned long) block_pointer) const + 340 frame #15: 0x000000019ad34138 CoreML`-[MLMultiArray(ScopedBufferAccess) getBytesWithHandler:] + 152 frame pytorch#16: 0x000000030c7485ec _portable_lib.cpython-310-darwin.so`(anonymous namespace)::copy(MLMultiArray*, executorchcoreml::MultiArray&) + 296 frame pytorch#17: 0x000000030c744f68 _portable_lib.cpython-310-darwin.so`(anonymous namespace)::set_outputs(std::__1::vector<executorchcoreml::MultiArray, std::__1::allocator<executorchcoreml::MultiArray>>&, NSArray<MLMultiArray*>*) + 180 ``` With this PR, the process succeeds.

…4w and so on

…o Asymmetric

Co-authored-by: Daniil Lyakhov <[email protected]>

…xecutorch into an/ovquantizer

[OVQuantizer] Apply Fixes and Integrate into the Llama Example Workflow

Will land this PR and cherry-pick to release/1.0 branch as we approach to 1.0 release.

@Gasoonjia

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#15004 by @Gasoonjia ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/gasoonjia/54/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/54/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/54/orig Differential Revision: [D84367515](https://our.internmc.facebook.com/intern/diff/D84367515/) @diff-train-skip-merge Co-authored-by: gasoonjia <[email protected]>

This pull request introduces changes to the CUDA workflow, model artifact handling, and multimodal runner logic. The main changes include restructuring the GitHub Actions workflow to separate model export, benchmarking, and end-to-end testing for the Voxtral CUDA pipeline, improving artifact management and reproducibility. Additionally, the multimodal runner now supports automatic conversion of audio tensors to bfloat16, ensuring compatibility with expected input types. There are also enhancements to caching and symbol registration in the CUDA backend, and build system updates to support linking the CUDA backend. **Workflow and Artifact Management Improvements:** * Refactored `.github/workflows/cuda.yml` to split the Voxtral CUDA pipeline into three jobs: `export-voxtral-cuda-artifact` (exports and stores model artifacts), `benchmark-voxtral-cuda` (benchmarks using exported artifacts), and `test-voxtral-cuda-e2e` (runs full end-to-end tests with artifact download and audio input). Improved artifact handling, reproducibility, and added explicit checks for required files. [[1]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89L90-R91) [[2]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R107) [[3]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R134-R185) [[4]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R196-R267) [[5]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R122) **Multimodal Runner Logic:** * Added automatic conversion of audio tensors to bfloat16 in `MultimodalPrefiller::prefill` and implemented a helper function `convert_to_bfloat16` in `util.h` to support this. This ensures that audio inputs match the expected dtype for the encoder, improving robustness for multimodal inference. [[1]](diffhunk://#diff-ad4fcb32ffc5f1f7b4f87b5ee58927cb948a8c0976295befd10e3de445913ae4L96-R136) [[2]](diffhunk://#diff-db4801445eaa3bb4f1370fe41d3a00ae2e3ef354a23ad4d5ace141ecc3c6f413R144-R180) **CUDA Backend and Caching Enhancements:** * Improved caching logic in `common_shims.cpp` for tensor strides and sizes by validating cached values and updating them when necessary. This prevents stale cache issues and ensures correct tensor metadata. [[1]](diffhunk://#diff-1e7c9d572d434c9a85c9d466e7f406877bc974a373c370fe7ddb3fe32852c1f2R54-R81) [[2]](diffhunk://#diff-1e7c9d572d434c9a85c9d466e7f406877bc974a373c370fe7ddb3fe32852c1f2R104-R130) * Added dynamic symbol re-registration in `CudaBackend` to handle multiple shared objects in the same process, ensuring correct execution when switching between models. * Removed redundant logging statements in CUDA backend for cleaner output. [[1]](diffhunk://#diff-a4b17eccf1aa933837671c5184e02bc815d934a362344bb2b17b789cdfaa5375L226) [[2]](diffhunk://#diff-a4b17eccf1aa933837671c5184e02bc815d934a362344bb2b17b789cdfaa5375L256) **Build System Updates:** * Updated `CMakeLists.txt` and `executorch-config.cmake` to include and link the CUDA backend (`aoti_cuda`) when building Voxtral and other components, improving build flexibility and CUDA support. [[1]](diffhunk://#diff-606feb24310595f592d98d021a2c90618346977d94decb80b35b7e26ed8ccc1eR89-R95) [[2]](diffhunk://#diff-6a78a155992483ff6f35d595ff6cef63b477d1c853f6482e77acae6ef443f0e4R56) **Debugging and Tuning Options:** * Added support for enabling debug compilation in `cuda_backend.py` via the `DEBUG` environment variable, allowing easier troubleshooting and development.

…ementwiseOps to the common section. Differential Revision: D83793229 Pull Request resolved: pytorch#14780

Differential Revision: D84357937 Pull Request resolved: pytorch#14890

Differential Revision: D84187909 Pull Request resolved: pytorch#14958

…ch#14993) Signed-off-by: Ryan O'Shea <[email protected]>

### Summary - refactor a bit & add more test cases ### Test plan ```bash python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_index_put -b build-android -s $SN -m SM8750 python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_index_put_suite -b build-android -s $SN -m SM8750 ```

Summary: Updating the TOSA, U55 & U85 tests to remove xfails. These ops are supported now and updating tests to not expect failure. Differential Revision: D84262200

Differential Revision: D81703253 Pull Request resolved: pytorch#15011

Differential Revision: D84279595 Pull Request resolved: pytorch#14956

@Gasoonjia

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#15016 by @Gasoonjia ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/gasoonjia/56/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/56/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/56/orig Differential Revision: [D84280496](https://our.internmc.facebook.com/intern/diff/D84280496/) @diff-train-skip-merge Co-authored-by: gasoonjia <[email protected]>

@robert-kalmar

…s._clone_dim_order.default (pytorch#14535) ### Summary - Adds support for conversion and quantization of `dim_order_ops._clone_dim_order.default` operator and fixes problems with some variations of `nn.Dropout`. - Adds more robust test cases for clone operators. ### Test plan All changes should be covered by unit tests. cc @robert-kalmar @JakeStevens @digantdesai

fix unexpanded VGF term use.

…etal backend (pytorch#15003)

Summary: As stated in the title Reviewed By: bingcy Differential Revision: D83859440 --------- Co-authored-by: Jacob Szwejbka <[email protected]>

Updated link to Core ATen operator set documentation. ### Summary [PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines. [PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line. [PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests). ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

Summary: Wire up the unary sine operator in xnnpack for fp32 and fp16. Differential Revision: D83623086

Summary: Fix up flags. Differential Revision: D84296634

@lucylq

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#14666 by @lucylq ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/lucylq/114/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/lucylq/114/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/lucylq/114/orig Differential Revision: [D83504588](https://our.internmc.facebook.com/intern/diff/D83504588/) @diff-train-skip-merge Co-authored-by: lucylq <[email protected]>

Summary: . Differential Revision: D84516559

Summary: Copied assets from https://github.com/dbort/executorch-logos/

Summary: TensorPtr view created with TensorPtr should keep it alive to match ATen behavior. Differential Revision: D84512176

https://www.internalfb.com/phabricator/paste/view/P1990751294

Differential Revision: [D83777195](https://our.internmc.facebook.com/intern/diff/D83777195/) [ghstack-poisoned] ### Summary [PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines. [PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line. [PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests). ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

@SS-JIA

pytorch#15066) … Clamp/Clamp (pytorch#14415)" This reverts commit a5d7e5c. Broke internal builds @SS-JIA is trying to fix this in pytorch#15058 will leave relanding to him ### Summary [PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines. [PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line. [PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests). ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

cavusmustafa and others added 29 commits September 5, 2025 10:02

formatting fix

5f657d3

formatting fix

eafcc33

formatting fix

1763b99

formatting fix

4863826

formatting fix

e24072f

formatting fix

b9bb5f0

formatting fix

291dcd9

use new transformations

c8ea777

add comment for manual MP allocation

a6b605f

remove nncf_compression from export llama lib

9614fc4

change pt2e quantize flag to use openvino_4wo instead of openvino_8da…

45007cf

…4w and so on

follow up to last commit

9d49414

update quantizer lib with openvino_4wo

d6727cf

split qspec function into 2 parts; 1 for WC and other for PTQ qspecs

4a0a781

micro fix

f6a1ee3

udpate mixed precision layers for higher accuracy. Change INT4 mode t…

d285fcc

…o Asymmetric

Apply suggestions from code review

4e66df1

Co-authored-by: Daniil Lyakhov <[email protected]>

Review changes

e850e41

review changes in quantizer

204043f

revert extra args changes

ae6b089

Merge branch 'openvino_llama_support' of https://github.com/anzr299/e…

a6f036c

…xecutorch into an/ovquantizer

precommit fixes

2de5693

revert _calculate_qparams back to calculate_qparams

0e10f28

remove manual ignored nodes

05f5a92

add ratio to quantizer initialization

fbe0e21

Update export_llama_lib.py

6bff1cd

Update quantizer_lib.py

d744ae9

Merge pull request #9 from anzr299/an/ovquantizer

21c43fe

[OVQuantizer] Apply Fixes and Integrate into the Llama Example Workflow

Updated NNCF commit id

b874204

mergennachin and others added 30 commits October 10, 2025 17:11

Promote pyproject beta to production/stable (pytorch#14777)

3bfd5e0

Will land this PR and cherry-pick to release/1.0 branch as we approach to 1.0 release.

Move RemovePermutesAroundElementwiseOps and RemoveSqueezeViewBeforeEl…

4609cdb

…ementwiseOps to the common section. Differential Revision: D83793229 Pull Request resolved: pytorch#14780

Arm backend: Upgrade vela to 4.4.1

1dc0e0e

Differential Revision: D84357937 Pull Request resolved: pytorch#14890

Add option to specify fake tensor mode for graph and program builders.

cc6cb83

Differential Revision: D84187909 Pull Request resolved: pytorch#14958

Arm backend: Enable parallel building on MLSDK emulation layer (pytor…

35d431b

…ch#14993) Signed-off-by: Ryan O'Shea <[email protected]>

Updating tests for 16A8W ops which are supported (pytorch#14945)

50a10a2

Summary: Updating the TOSA, U55 & U85 tests to remove xfails. These ops are supported now and updating tests to not expect failure. Differential Revision: D84262200

Including mixed quant GRU op in Jarvis

703d25a

Differential Revision: D81703253 Pull Request resolved: pytorch#15011

Support for batched matmul

e69700b

Differential Revision: D84279595 Pull Request resolved: pytorch#14956

Minor update for Arm README.md (pytorch#15045)

d00279d

fix unexpanded VGF term use.

Merge branch 'main' into openvino_llama_support

82bc4c5

Update top-level README.md file (pytorch#15049)

1a8acf6

[Metal] Update aoti_common with additional AOTI functions needed by M…

f84c423

…etal backend (pytorch#15003)

Move RemoveCatFromSliceCopyPass to the common section. (pytorch#14972)

626a7d1

Summary: As stated in the title Reviewed By: bingcy Differential Revision: D83859440 --------- Co-authored-by: Jacob Szwejbka <[email protected]>

Support sine operator on XNNPACK (pytorch#14711)

6efddba

Summary: Wire up the unary sine operator in xnnpack for fp32 and fp16. Differential Revision: D83623086

msvc support 1/N (pytorch#14970)

a66ea20

Summary: Fix up flags. Differential Revision: D84296634

Handle uint types. (pytorch#15055)

f19882b

Summary: . Differential Revision: D84516559

Use new logo in ExecuTorch (pytorch#14782)

b9451c9

Summary: Copied assets from https://github.com/dbort/executorch-logos/

Tensor view keeps original tensor alive. (pytorch#15056)

23db0bc

Summary: TensorPtr view created with TensorPtr should keep it alive to match ATen behavior. Differential Revision: D84512176

Ignore PRs that's empty (pytorch#15065)

8876113

https://www.internalfb.com/phabricator/paste/view/P1990751294

Changed quantization scheme

1428d81

Merge branch 'main' into openvino_llama_support

caba225

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Openvino llama support #6

Openvino llama support #6

Uh oh!

cavusmustafa commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

62 participants

Uh oh!

Openvino llama support #6

Are you sure you want to change the base?

Openvino llama support #6

Uh oh!

Conversation

cavusmustafa commented Jun 23, 2025

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

62 participants