Skip to content

Conversation

@cavusmustafa
Copy link
Owner

No description provided.

kirklandsign and others added 30 commits May 7, 2025 18:24
Differential Revision: D74365586

Pull Request resolved: pytorch#10765
Differential Revision: D74117402

Pull Request resolved: pytorch#10697
Notably, pinned prelude version includes
facebook/buck2-prelude@958af4f
. Also, we're able to simplify our Buck versioning logic now that Buck
has consistent versions across platforms
(facebook/buck2#828 (comment))
Differential Revision: D74369346

Pull Request resolved: pytorch#10764
…10771)

## Context

When quantizing models with the PT2E quantization flow, quantize/dequantize nodes will be inserted into the graph. However, these quantize/dequantize nodes must be fused with operators such as `aten.linear.default` to produce nodes corresponding to quantized operators (e.g. `weight_int8pack_mm`) in order for quantized operator implementations to be called at runtime.

Currently, the op fusion is done by the `fuse_dequant_linear.py` pass, however, this only handles one specific fusion pattern to generate a `weight_int8pack_mm` operator. As more quantized operators are to be supported in ET-VK via the PT2E quantization flow, a more generic fusion pass is needed that can handle a variety of fusion patterns.

## Changes

Introduce the `FuseQuantizedOpsTransform()` pass. I elected to introduce a new pass under the `backends/vulkan/_passes` directory, as opposed to modifying the existing pass because I anticipate the majority of the fusion patterns to be specific to ET-VK.

Remove the existing `FuseDequantLinearPass()`

Switch to using the `FuseQuantizedOpsTransform` pass instead of the old `FuseDequantLinear` pass.

Add `test_vulkan_passes` Python test to test export passes.

Some small refactors to `test_vulkan_delegate` Python test to improve code organizations.

Differential Revision: [D73794042](https://our.internmc.facebook.com/intern/diff/D73794042/)
## Context

Title says it all!

## Changes

Extended the implementation of `linear_qcsnw` to support packed 4-bit weight tensors.

Differential Revision: [D73941991](https://our.internmc.facebook.com/intern/diff/D73941991/)
This way other Llama variants than stories110m can be run.
### Summary
Refactoring of unit tests to allow for testing of TOSA 1.0
Adds command-line argument --arm_run_tosa_version to run tests on
particular version
### Summary

Instead of manually printing all the options in
`tools/cmake/Utils.cmake`, let's just "automatically" print all the
configured options.

### Test plan

```
$ ./scripts/build_apple_frameworks.sh --Debug

-- --- Configurated Options ---

-- EXECUTORCH_ENABLE_LOGGING : ON
-- ---------------------------

```

```
$ ./scripts/build_apple_frameworks.sh --Release

-- --- Configurated Options ---

-- EXECUTORCH_ENABLE_LOGGING : OFF
-- ---------------------------

```


cc @larryliu0820
…orch#10774)

Refactor assertion statements to raise ValueErrors for better error
handling in permutation matrix and vector transformations. Ensure that
conditions are checked and appropriate exceptions are raised to enhance
code robustness and readability.

Signed-off-by: Sebastian Larsson <[email protected]>
Summary: Minor change to reserve size for VkWriteDescriptorSet and
VkDescriptorSetLayoutBinding vectors.

Differential Revision: D74335276
### Summary

In this diff we create a helper that will allow presets to set options.
Again this is mostly a helper to check if the option has been defined
already, then no-oping.

To test it, I also create the first preset `macos-arm64`. I will test it
in upcoming diffs.

### Test plan

pytest for now, manual test in future diffs


cc @larryliu0820
### Summary
This change converts the unit test from java to kotlin.

### Test plan
./gradlew :executorch_android:testDebugUnitTest

---------

Co-authored-by: Haiting Pu <[email protected]>
### Summary

* Create the base for a macos-arm64 preset — bigger migration in future
diffs
* Create an Apple CI job to test builds

### Test plan

CI +

```
$ cmake --preset macos-arm64

-- Loading build preset: /Users/jathu/executorch/tools/cmake/preset/macos-arm64.cmake
-- --- Configurated Options ---

-- EXECUTORCH_BUILD_PRESET_FILE : /Users/jathu/executorch/tools/cmake/preset/macos-arm64.cmake
-- EXECUTORCH_ENABLE_LOGGING    : ON
-- EXECUTORCH_BUILD_COREML      : ON
-- ---------------------------

$ cmake --build cmake-out --parallel
```

cc @larryliu0820
Differential Revision: D73440517

Pull Request resolved: pytorch#10493
Differential Revision: D74349918

Pull Request resolved: pytorch#10760
Differential Revision: D74350331

Pull Request resolved: pytorch#10762
…nv op instead of cpu op for shapes not supported by the TIE kernel.

Differential Revision: D74337713

Pull Request resolved: pytorch#10770
Differential Revision: D74420616

Pull Request resolved: pytorch#10778
Differential Revision: D74041198

Pull Request resolved: pytorch#10660
Differential Revision: D74447383

Pull Request resolved: pytorch#10780
…10783)

Dont try to print with colors in the pre-push script if the script is
non-interactive. This is to avoid getting broken output in the CI which
doesnt support colors.

Signed-off-by: [email protected]
bloaty told me that we were paying a noticeable size cost for the
::value members of these structs (at least after the PR in this stack
that reapplies pytorch#9841) and now we're not.

Test Plan: bash test/build_optimized_size_test.sh

```
before:
adopt functionref
==========
ExecuTorch with no ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  153928 Apr 25 11:08 cmake-out/test/size_test
ExecuTorch with portable ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  2150960 Apr 25 11:08 cmake-out/test/size_test_all_ops
ExecuTorch with optimized ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  5927336 Apr 25 11:08 cmake-out/test/size_test_all_optimized_ops
(.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test*
__TEXT	__DATA	__OBJC	others	dec	hex
81920	81920	0	4295049216	4295213056	10003c000	cmake-out/test/size_test
1474560	81920	0	4295655424	4297211904	100224000	cmake-out/test/size_test_all_ops
4505600	98304	0	4296376320	4300980224	1005bc000	cmake-out/test/size_test_all_optimized_ops

after:
ExecuTorch with no ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  153928 Apr 25 12:24 cmake-out/test/size_test
ExecuTorch with portable ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  2150960 Apr 25 12:24 cmake-out/test/size_test_all_ops
ExecuTorch with optimized ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  5887368 Apr 25 12:24 cmake-out/test/size_test_all_optimized_ops
(.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test*
__TEXT	__DATA	__OBJC	others	dec	hex
81920	81920	0	4295049216	4295213056	10003c000	cmake-out/test/size_test
1474560	81920	0	4295655424	4297211904	100224000	cmake-out/test/size_test_all_ops
4489216	98304	0	4296359936	4300947456	1005b4000	cmake-out/test/size_test_all_optimized_ops
```

(yes it's neutral; improves size results for further diffs)
…ve build is not in use (pytorch#10490)

We duplicate a lot of functions depending on the operator name so that
dtype selective build will work. We can just detect if dtype selective
build is in use and, if not, stop duplicating.

Test Plan: compared results of bash test/build_optimized_size_test.sh
before/after this rev.

Before:
```
ExecuTorch with no ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  153928 Apr 25 12:24 cmake-out/test/size_test
ExecuTorch with portable ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  2150960 Apr 25 12:24 cmake-out/test/size_test_all_ops
ExecuTorch with optimized ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  5887368 Apr 25 12:24 cmake-out/test/size_test_all_optimized_ops
(.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test*
__TEXT	__DATA	__OBJC	others	dec	hex
81920	81920	0	4295049216	4295213056	10003c000	cmake-out/test/size_test
1474560	81920	0	4295655424	4297211904	100224000	cmake-out/test/size_test_all_ops
4489216	98304	0	4296359936	4300947456	1005b4000	cmake-out/test/size_test_all_optimized_ops
```

After:
```
ExecuTorch with no ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  153928 Apr 25 12:51 cmake-out/test/size_test
ExecuTorch with portable ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  1796928 Apr 25 12:51 cmake-out/test/size_test_all_ops
ExecuTorch with optimized ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  5605176 Apr 25 12:51 cmake-out/test/size_test_all_optimized_ops
(.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test*
__TEXT	__DATA	__OBJC	others	dec	hex
81920	81920	0	4295049216	4295213056	10003c000	cmake-out/test/size_test
1310720	81920	0	4295458816	4296851456	1001cc000	cmake-out/test/size_test_all_ops
4358144	98304	0	4296212480	4300668928	100570000	cmake-out/test/size_test_all_optimized_ops
```

(This was reverted because the diff it was stacked on was a size
regression. Reversing the order instead this time around, and reverted
part of the change that was actually regressing size.)
…s with out_dtypes in template arguments (pytorch#10491)

This is necessary to take advantage of pytorch#9388, which
creates dtype-specialized implementations for the non-mixed dtype case.

Measured the size cost of this approach with
test/build_optimized_size_test.sh . It does cost us some size:

```
Before:
ExecuTorch with no ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  153928 Apr 25 12:51 cmake-out/test/size_test
ExecuTorch with portable ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  1796928 Apr 25 12:51 cmake-out/test/size_test_all_ops
ExecuTorch with optimized ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  5605176 Apr 25 12:51 cmake-out/test/size_test_all_optimized_ops
(.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test*
__TEXT	__DATA	__OBJC	others	dec	hex
81920	81920	0	4295049216	4295213056	10003c000	cmake-out/test/size_test
1310720	81920	0	4295458816	4296851456	1001cc000	cmake-out/test/size_test_all_ops
4358144	98304	0	4296212480	4300668928	100570000	cmake-out/test/size_test_all_optimized_ops

After:
ExecuTorch with no ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  153928 Apr 25 12:57 cmake-out/test/size_test
ExecuTorch with portable ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  1889792 Apr 25 12:57 cmake-out/test/size_test_all_ops
ExecuTorch with optimized ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  5799704 Apr 25 12:57 cmake-out/test/size_test_all_optimized_ops
(.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test*
__TEXT	__DATA	__OBJC	others	dec	hex
81920	81920	0	4295049216	4295213056	10003c000	cmake-out/test/size_test
1376256	81920	0	4295491584	4296949760	1001e4000	cmake-out/test/size_test_all_ops
4423680	98304	0	4296327168	4300849152	10059c000	cmake-out/test/size_test_all_optimized_ops
```

However, on an absolute basis, size is still below where we are at two
PRs ago, which was:

```
ExecuTorch with no ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  153928 Apr 25 12:24 cmake-out/test/size_test
ExecuTorch with portable ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  2150960 Apr 25 12:24 cmake-out/test/size_test_all_ops
ExecuTorch with optimized ops binary size, unstripped:
-rwxr-xr-x  1 swolchok  staff  5887368 Apr 25 12:24 cmake-out/test/size_test_all_optimized_ops
(.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test*
__TEXT	__DATA	__OBJC	others	dec	hex
81920	81920	0	4295049216	4295213056	10003c000	cmake-out/test/size_test
1474560	81920	0	4295655424	4297211904	100224000	cmake-out/test/size_test_all_ops
4489216	98304	0	4296359936	4300947456	1005b4000	cmake-out/test/size_test_all_optimized_ops
```
Differential Revision: D74495058

Pull Request resolved: pytorch#10793
Differential Revision: D74226258

Pull Request resolved: pytorch#10708
anzr299 and others added 22 commits May 19, 2025 22:43
Differential Revision: D74833331

Pull Request resolved: pytorch#10921
### Summary
Adds input size validation to `Module.execute` to prevent possible
silent memory corruption when too many EValue inputs are passed.

Fixes pytorch#10510 

### Test plan
- Added unit test `TestExecuteWithTooManyInputs`
- Verified by successfully running all `module_test.cpp` tests, except
`TestPTD` (did not have access to `ModuleLinear.ptd`)
- To run locally:  
- Bypass `is_fbcode` guard in `targets.bzl` and redirect test file paths
to use a locally exported `ModuleAdd.pte` file
  - Build and run tests via:
  
  ```
  buck2 build //extension/module/test:test
  buck2 run //extension/module/test:test

---------

Co-authored-by: Anthony Shoumikhin <[email protected]>
Differential Revision: D75006941

Pull Request resolved: pytorch#10974
Differential Revision: D74967760

Pull Request resolved: pytorch#10962
…Ethos-U85 (pytorch#10973)

Temporary solution to the problem in
pytorch#10958 The
arm_executor_runner.cpp need to declare the ethosu_fast_scratch array
and pass it onto to the EthosUBackend.cpp. It is important that for
Shared_Sram, the ethosu_fast_scratch is nullptr and for Dedicated_Sram
it points to the fast memory array.
Summary:
## Context

Fix third party `CMakeLists.txt` to allow `flatcc` to build for Windows.
Some CMake configuration settings need to be adjusted for windows
platforms.

Test Plan:
## Test Plan

```
python install_executorch.py
```
### Summary
- use 'fold_quantize=False' in convert_pt2e to prevent overwriting
state_dict during lowering
- change in _get_updated_graph_siganture to have signature detected
correctly
Differential Revision: D75024936

Pull Request resolved: pytorch#10889
Pull Request resolved: pytorch#10877

So we can use them in codegen.bzl later (can't pull in definitions from targets.bzl files).
ghstack-source-id: 284862879

Differential Revision: [D74741846](https://our.internmc.facebook.com/intern/diff/D74741846/)
Differential Revision: D74865527

Pull Request resolved: pytorch#10938
Pull Request resolved: pytorch#10878

Add dtype selective build for optimized ops. Follows the same process as portable, where we copy the source files and rebuild the library.

1. Generalize copy genrule for portable/optimized/source/header.
2. Copy optimized source files + headers.
3. Build optimized ops using source files, dependencies, portable header.
4. Add test, confirm that we can run addmul with float dtypes (when we remove, the test fails).
ghstack-source-id: 284862896
@exported-using-ghexport

Differential Revision: [D74688554](https://our.internmc.facebook.com/intern/diff/D74688554/)
Makes it possible to annotate patterns with more than two operators.
This allows us to annotate patterns: conv -> bn and conv -> bn -> relu
to be able to fold away BN after training in QAT. Also adds support for
QAT in Tester class.

Signed-off-by: Oscar Andersson <[email protected]>
### Summary
Update model unit tests to use the new test infrastructure pipeline.
… stride (pytorch#10972)

* AvgPool2dVisitor will adjust the padding so the pooling window is
divisible by the stride
* Improve tests in test_max_pool.py

Signed-off-by: Tom Allsop <[email protected]>
- Removes duplicated matmul tests.
- Replaces pytest.mark_flaky with qtol for quantized tests cases of
mm/bmm.

Signed-off-by: Oscar Andersson <[email protected]>
ortExport llama executorch
@cavusmustafa cavusmustafa had a problem deploying to upload-benchmark-results May 24, 2025 23:52 — with GitHub Actions Error
@cavusmustafa cavusmustafa had a problem deploying to upload-benchmark-results June 7, 2025 23:41 — with GitHub Actions Error
cavusmustafa pushed a commit that referenced this pull request Jun 20, 2025
Differential Revision: D75104487

Pull Request resolved: pytorch#11021
cavusmustafa pushed a commit that referenced this pull request Jun 20, 2025
Differential Revision: D75718888

Pull Request resolved: pytorch#11444
cavusmustafa pushed a commit that referenced this pull request Jun 20, 2025
Differential Revision: D76157744

Pull Request resolved: pytorch#11501
cavusmustafa pushed a commit that referenced this pull request Aug 19, 2025
BNNS copy crashes the process when the dtypes differ
(pytorch#11714).

With the example in this PR
(pytorch#11714), we crash the
process on main. Here is the stack trace from LLDB:

```
Process 19234 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x0000000190ac9388 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x190ac9388 <+8>:  b.lo   0x190ac93a8    ; <+40>
    0x190ac938c <+12>: pacibsp 
    0x190ac9390 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x190ac9394 <+20>: mov    x29, sp
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x0000000190ac9388 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000190b0288c libsystem_pthread.dylib`pthread_kill + 296
    frame #2: 0x0000000190a0bc60 libsystem_c.dylib`abort + 124
    frame #3: 0x0000000190910174 libsystem_malloc.dylib`malloc_vreport + 892
    frame #4: 0x0000000190913c90 libsystem_malloc.dylib`malloc_report + 64
    frame #5: 0x000000019091821c libsystem_malloc.dylib`___BUG_IN_CLIENT_OF_LIBMALLOC_POINTER_BEING_FREED_WAS_NOT_ALLOCATED + 32
    frame #6: 0x000000019d2f4084 libBNNS.dylib`___lldb_unnamed_symbol1620 + 564
    frame #7: 0x000000019d2f5bac libBNNS.dylib`___lldb_unnamed_symbol1628 + 680
    frame #8: 0x000000019d69ce48 libBNNS.dylib`BNNSCopy + 616
    frame #9: 0x000000030c74d950 _portable_lib.cpython-310-darwin.so`(anonymous namespace)::copy_using_bnns(executorchcoreml::MultiArray const&, executorchcoreml::MultiArray&) + 188
    frame #10: 0x000000030c74cfdc _portable_lib.cpython-310-darwin.so`(anonymous namespace)::copy(executorchcoreml::MultiArray const&, executorchcoreml::MultiArray&, executorchcoreml::MultiArray::CopyOptions) + 72
    frame #11: 0x000000030c74ceec _portable_lib.cpython-310-darwin.so`executorchcoreml::MultiArray::copy(executorchcoreml::MultiArray&, executorchcoreml::MultiArray::CopyOptions) const + 148
    frame #12: 0x000000030c7488d4 _portable_lib.cpython-310-darwin.so`invocation function for block in (anonymous namespace)::copy(MLMultiArray*, executorchcoreml::MultiArray&) + 376
    frame #13: 0x000000030c748ac8 _portable_lib.cpython-310-darwin.so`invocation function for block in (anonymous namespace)::copy(MLMultiArray*, executorchcoreml::MultiArray&) + 52
    frame #14: 0x000000019ad33f4c CoreML`CoreML::MultiArrayBuffer::getBytesWithHandler(void (void const*, unsigned long) block_pointer) const + 340
    frame #15: 0x000000019ad34138 CoreML`-[MLMultiArray(ScopedBufferAccess) getBytesWithHandler:] + 152
    frame pytorch#16: 0x000000030c7485ec _portable_lib.cpython-310-darwin.so`(anonymous namespace)::copy(MLMultiArray*, executorchcoreml::MultiArray&) + 296
    frame pytorch#17: 0x000000030c744f68 _portable_lib.cpython-310-darwin.so`(anonymous namespace)::set_outputs(std::__1::vector<executorchcoreml::MultiArray, std::__1::allocator<executorchcoreml::MultiArray>>&, NSArray<MLMultiArray*>*) + 180
```


With this PR, the process succeeds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.