Skip to content

Conversation

@cavusmustafa
Copy link
Owner

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

cavusmustafa pushed a commit that referenced this pull request Aug 19, 2025
BNNS copy crashes the process when the dtypes differ
(pytorch#11714).

With the example in this PR
(pytorch#11714), we crash the
process on main. Here is the stack trace from LLDB:

```
Process 19234 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x0000000190ac9388 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x190ac9388 <+8>:  b.lo   0x190ac93a8    ; <+40>
    0x190ac938c <+12>: pacibsp 
    0x190ac9390 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x190ac9394 <+20>: mov    x29, sp
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x0000000190ac9388 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000190b0288c libsystem_pthread.dylib`pthread_kill + 296
    frame #2: 0x0000000190a0bc60 libsystem_c.dylib`abort + 124
    frame #3: 0x0000000190910174 libsystem_malloc.dylib`malloc_vreport + 892
    frame #4: 0x0000000190913c90 libsystem_malloc.dylib`malloc_report + 64
    frame #5: 0x000000019091821c libsystem_malloc.dylib`___BUG_IN_CLIENT_OF_LIBMALLOC_POINTER_BEING_FREED_WAS_NOT_ALLOCATED + 32
    frame #6: 0x000000019d2f4084 libBNNS.dylib`___lldb_unnamed_symbol1620 + 564
    frame #7: 0x000000019d2f5bac libBNNS.dylib`___lldb_unnamed_symbol1628 + 680
    frame #8: 0x000000019d69ce48 libBNNS.dylib`BNNSCopy + 616
    frame #9: 0x000000030c74d950 _portable_lib.cpython-310-darwin.so`(anonymous namespace)::copy_using_bnns(executorchcoreml::MultiArray const&, executorchcoreml::MultiArray&) + 188
    frame #10: 0x000000030c74cfdc _portable_lib.cpython-310-darwin.so`(anonymous namespace)::copy(executorchcoreml::MultiArray const&, executorchcoreml::MultiArray&, executorchcoreml::MultiArray::CopyOptions) + 72
    frame #11: 0x000000030c74ceec _portable_lib.cpython-310-darwin.so`executorchcoreml::MultiArray::copy(executorchcoreml::MultiArray&, executorchcoreml::MultiArray::CopyOptions) const + 148
    frame #12: 0x000000030c7488d4 _portable_lib.cpython-310-darwin.so`invocation function for block in (anonymous namespace)::copy(MLMultiArray*, executorchcoreml::MultiArray&) + 376
    frame #13: 0x000000030c748ac8 _portable_lib.cpython-310-darwin.so`invocation function for block in (anonymous namespace)::copy(MLMultiArray*, executorchcoreml::MultiArray&) + 52
    frame #14: 0x000000019ad33f4c CoreML`CoreML::MultiArrayBuffer::getBytesWithHandler(void (void const*, unsigned long) block_pointer) const + 340
    frame #15: 0x000000019ad34138 CoreML`-[MLMultiArray(ScopedBufferAccess) getBytesWithHandler:] + 152
    frame pytorch#16: 0x000000030c7485ec _portable_lib.cpython-310-darwin.so`(anonymous namespace)::copy(MLMultiArray*, executorchcoreml::MultiArray&) + 296
    frame pytorch#17: 0x000000030c744f68 _portable_lib.cpython-310-darwin.so`(anonymous namespace)::set_outputs(std::__1::vector<executorchcoreml::MultiArray, std::__1::allocator<executorchcoreml::MultiArray>>&, NSArray<MLMultiArray*>*) + 180
```


With this PR, the process succeeds.
mergennachin and others added 30 commits October 10, 2025 17:11
Will land this PR and cherry-pick to release/1.0 branch as we approach
to 1.0 release.
This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: pytorch#15004 by
@Gasoonjia
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/gasoonjia/54/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/gasoonjia/54/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/gasoonjia/54/orig
Differential Revision:
[D84367515](https://our.internmc.facebook.com/intern/diff/D84367515/)
@diff-train-skip-merge

Co-authored-by: gasoonjia <[email protected]>
This pull request introduces changes to the CUDA workflow, model
artifact handling, and multimodal runner logic. The main changes include
restructuring the GitHub Actions workflow to separate model export,
benchmarking, and end-to-end testing for the Voxtral CUDA pipeline,
improving artifact management and reproducibility. Additionally, the
multimodal runner now supports automatic conversion of audio tensors to
bfloat16, ensuring compatibility with expected input types. There are
also enhancements to caching and symbol registration in the CUDA
backend, and build system updates to support linking the CUDA backend.

**Workflow and Artifact Management Improvements:**

* Refactored `.github/workflows/cuda.yml` to split the Voxtral CUDA
pipeline into three jobs: `export-voxtral-cuda-artifact` (exports and
stores model artifacts), `benchmark-voxtral-cuda` (benchmarks using
exported artifacts), and `test-voxtral-cuda-e2e` (runs full end-to-end
tests with artifact download and audio input). Improved artifact
handling, reproducibility, and added explicit checks for required files.
[[1]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89L90-R91)
[[2]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R107)
[[3]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R134-R185)
[[4]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R196-R267)
[[5]](diffhunk://#diff-29abea04e0613c2569973e5c8e3c89e04846d408c855eeb1f3efcfae7cfa6f89R122)

**Multimodal Runner Logic:**

* Added automatic conversion of audio tensors to bfloat16 in
`MultimodalPrefiller::prefill` and implemented a helper function
`convert_to_bfloat16` in `util.h` to support this. This ensures that
audio inputs match the expected dtype for the encoder, improving
robustness for multimodal inference.
[[1]](diffhunk://#diff-ad4fcb32ffc5f1f7b4f87b5ee58927cb948a8c0976295befd10e3de445913ae4L96-R136)
[[2]](diffhunk://#diff-db4801445eaa3bb4f1370fe41d3a00ae2e3ef354a23ad4d5ace141ecc3c6f413R144-R180)

**CUDA Backend and Caching Enhancements:**

* Improved caching logic in `common_shims.cpp` for tensor strides and
sizes by validating cached values and updating them when necessary. This
prevents stale cache issues and ensures correct tensor metadata.
[[1]](diffhunk://#diff-1e7c9d572d434c9a85c9d466e7f406877bc974a373c370fe7ddb3fe32852c1f2R54-R81)
[[2]](diffhunk://#diff-1e7c9d572d434c9a85c9d466e7f406877bc974a373c370fe7ddb3fe32852c1f2R104-R130)
* Added dynamic symbol re-registration in `CudaBackend` to handle
multiple shared objects in the same process, ensuring correct execution
when switching between models.
* Removed redundant logging statements in CUDA backend for cleaner
output.
[[1]](diffhunk://#diff-a4b17eccf1aa933837671c5184e02bc815d934a362344bb2b17b789cdfaa5375L226)
[[2]](diffhunk://#diff-a4b17eccf1aa933837671c5184e02bc815d934a362344bb2b17b789cdfaa5375L256)

**Build System Updates:**

* Updated `CMakeLists.txt` and `executorch-config.cmake` to include and
link the CUDA backend (`aoti_cuda`) when building Voxtral and other
components, improving build flexibility and CUDA support.
[[1]](diffhunk://#diff-606feb24310595f592d98d021a2c90618346977d94decb80b35b7e26ed8ccc1eR89-R95)
[[2]](diffhunk://#diff-6a78a155992483ff6f35d595ff6cef63b477d1c853f6482e77acae6ef443f0e4R56)

**Debugging and Tuning Options:**

* Added support for enabling debug compilation in `cuda_backend.py` via
the `DEBUG` environment variable, allowing easier troubleshooting and
development.
…ementwiseOps to the common section.

Differential Revision: D83793229

Pull Request resolved: pytorch#14780
Differential Revision: D84357937

Pull Request resolved: pytorch#14890
Differential Revision: D84187909

Pull Request resolved: pytorch#14958
### Summary
- refactor a bit & add more test cases


### Test plan
```bash
python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_index_put -b build-android -s $SN -m SM8750
python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_index_put_suite -b build-android -s $SN -m SM8750
```
Summary:

Updating the TOSA, U55 & U85 tests to remove xfails. These ops are supported now and updating tests to not expect failure.

Differential Revision: D84262200
Differential Revision: D81703253

Pull Request resolved: pytorch#15011
Differential Revision: D84279595

Pull Request resolved: pytorch#14956
This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: pytorch#15016 by
@Gasoonjia
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/gasoonjia/56/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/gasoonjia/56/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/gasoonjia/56/orig
Differential Revision:
[D84280496](https://our.internmc.facebook.com/intern/diff/D84280496/)
@diff-train-skip-merge

Co-authored-by: gasoonjia <[email protected]>
…s._clone_dim_order.default (pytorch#14535)

### Summary
- Adds support for conversion and quantization of
`dim_order_ops._clone_dim_order.default` operator and fixes problems
with some variations of `nn.Dropout`.
- Adds more robust test cases for clone operators.

### Test plan
All changes should be covered by unit tests.

cc @robert-kalmar @JakeStevens @digantdesai
Summary: As stated in the title

Reviewed By: bingcy

Differential Revision: D83859440

---------

Co-authored-by: Jacob Szwejbka <[email protected]>
Updated link to Core ATen operator set documentation.

### Summary
[PLEASE REMOVE] See [CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests)
for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a `Fixes
#<issue-id>` line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be
the upcoming release notes, please add a "Release notes: <area>" label.
For a list of available release notes labels, check out
[CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests).

### Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual
commands you used and note down tests that you have written if
applicable.
Summary: Wire up the unary sine operator in xnnpack for fp32 and fp16.

Differential Revision: D83623086
Summary: Fix up flags.

Differential Revision: D84296634
This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: pytorch#14666 by
@lucylq
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/lucylq/114/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/lucylq/114/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/lucylq/114/orig
Differential Revision:
[D83504588](https://our.internmc.facebook.com/intern/diff/D83504588/)
@diff-train-skip-merge

Co-authored-by: lucylq <[email protected]>
Summary: .

Differential Revision: D84516559
Summary: TensorPtr view created with TensorPtr should keep it alive to
match ATen behavior.

Differential Revision: D84512176
Differential Revision:
[D83777195](https://our.internmc.facebook.com/intern/diff/D83777195/)

[ghstack-poisoned]

### Summary
[PLEASE REMOVE] See [CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests)
for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a `Fixes
#<issue-id>` line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be
the upcoming release notes, please add a "Release notes: <area>" label.
For a list of available release notes labels, check out
[CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests).

### Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual
commands you used and note down tests that you have written if
applicable.
pytorch#15066)

… Clamp/Clamp (pytorch#14415)"

This reverts commit a5d7e5c.

Broke internal builds @SS-JIA is trying to fix this in
pytorch#15058 will leave relanding to
him

### Summary
[PLEASE REMOVE] See [CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests)
for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a `Fixes
#<issue-id>` line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be
the upcoming release notes, please add a "Release notes: <area>" label.
For a list of available release notes labels, check out
[CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests).

### Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual
commands you used and note down tests that you have written if
applicable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.