Issues with deloyment on RP2040 #7177

AIWintermuteAI · 2024-12-04T14:31:59Z

🐛 Describe the bug

I attempted (with mixed success) to deploy ARM examples to a regular MCU, without any special DSP or NN accelerator. I chose RP2040, because it's build system is centered around CMake, so it was easier to modify existing example.
I uploaded my code to https://github.com/AIWintermuteAI/executorch/tree/port-to-rp2040, it should be easy enough to reproduce following the instructions.

For convenience, I'm also copying the results and the issues encountered here.

Softmax builds and runs normally

cd examples/arm
./run.sh --build_only --scratch-dir=build-dir --model_name=softmax --aot_arm_compiler_flags=""

I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:386] Model in 200014B0
I [executorch:arm_executor_runner.cpp:388] Model PTE file loaded. Size: 960 bytes.
I [executorch:arm_executor_runner.cpp:398] Model buffer loaded, has 1 methods
I [executorch:arm_executor_runner.cpp:406] Running method forward
I [executorch:arm_executor_runner.cpp:417] Setup Method allocator pool. Size: 1024 bytes.
I [executorch:arm_executor_runner.cpp:434] Setting up planned buffer 0, size 32.
I [executorch:arm_executor_runner.cpp:467] Method loaded.
I [executorch:arm_executor_runner.cpp:469] Preparing inputs...
I [executorch:arm_executor_runner.cpp:483] Input prepared.
I [executorch:arm_executor_runner.cpp:485] Starting the model execution...
I [executorch:arm_executor_runner.cpp:492] model_pte_loaded_size:     960 bytes.
I [executorch:arm_executor_runner.cpp:506] method_allocator_used:     342 / 1024  free: 682 ( used: 33 % )
I [executorch:arm_executor_runner.cpp:513] method_allocator_planned:  32 bytes
I [executorch:arm_executor_runner.cpp:515] method_allocator_loaded:   290 bytes
I [executorch:arm_executor_runner.cpp:516] method_allocator_input:    20 bytes
I [executorch:arm_executor_runner.cpp:517] method_allocator_executor: 0 bytes
I [executorch:arm_executor_runner.cpp:520] temp_allocator_used:       0 / 1024 free: 1024 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:536] Model executed successfully.
I [executorch:arm_executor_runner.cpp:540] 1 outputs:
Output[0][0]: 0.500000
Output[0][1]: 0.500000
Output[0][2]: 0.500000
Output[0][3]: 0.500000
I [executorch:arm_executor_runner.cpp:577] Program complete, exiting.
I [executorch:arm_executor_runner.cpp:581]

Linear and add hang at Starting the model execution.

cd examples/arm
./run.sh --build_only --scratch-dir=build-dir --model_name=linear --aot_arm_compiler_flags=""

I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:386] Model in 200014B0 <
I [executorch:arm_executor_runner.cpp:388] Model PTE file loaded. Size: 1596 bytes.
I [executorch:arm_executor_runner.cpp:398] Model buffer loaded, has 1 methods
I [executorch:arm_executor_runner.cpp:406] Running method forward
I [executorch:arm_executor_runner.cpp:417] Setup Method allocator pool. Size: 1024 bytes.
I [executorch:arm_executor_runner.cpp:434] Setting up planned buffer 0, size 144.
I [executorch:arm_executor_runner.cpp:467] Method loaded.
I [executorch:arm_executor_runner.cpp:469] Preparing inputs...
I [executorch:arm_executor_runner.cpp:483] Input prepared.
I [executorch:arm_executor_runner.cpp:485] Starting the model execution...

Quantized MobileNetv2 alpha 0.05 96x96x3 requires allocation of 1.45 Mb of RAM.

cd examples/arm
./run.sh --build_only --scratch-dir=build-dir --model_name=mv2_untrained --aot_arm_compiler_flags="--quantize"

I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:386] Model in 200014B0 <
I [executorch:arm_executor_runner.cpp:388] Model PTE file loaded. Size: 175008 bytes.
I [executorch:arm_executor_runner.cpp:398] Model buffer loaded, has 1 methods
I [executorch:arm_executor_runner.cpp:406] Running method forward
I [executorch:arm_executor_runner.cpp:417] Setup Method allocator pool. Size: 1024 bytes.
I [executorch:arm_executor_runner.cpp:434] Setting up planned buffer 0, size 1785600.
E [executorch:memory_allocator.h:88] Memory allocation failed: 1785600B requested (adjusted for alignment), 1024B available
E [executorch:memory_allocator.h:88] Memory allocation failed: 68208B requested (adjusted for alignment), 1024B available
I [executorch:arm_executor_runner.cpp:459] Loading of method forward failed with status 0x21
I [executorch:arm_executor_runner.cpp:467] Method loaded.
I [executorch:arm_executor_runner.cpp:469] Preparing inputs...
F [executorch:result.h:165] In function CheckOk(), assert failed: hasValue_

I do not clearly understand why linear and add models fail to run on the hardware, while softmax succeeds.
Also the 1.45 Mb allocation for quantized MobileNetv2 alpha 0.05 96x96x3 seems excessive... Is that indeed current limitation due to executorch engine overhead or have I made a mistake?

Related issue:
#3585

Some work being done here (thanks for the support, @ChristophKarlHeck!)
https://github.com/ChristophKarlHeck/mbed-torch-fusion-os/tree/main

But I'm also seeing only softmax example - @ChristophKarlHeck were you able to make other models work on M4?

CC @zingo as I think you also worked on ARM example?

Versions

executorch % python collect_env.py
Collecting environment information...
PyTorch version: 2.6.0.dev20241112
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.6.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.1.0.2.5)
CMake version: version 3.30.5
Libc version: N/A

Python version: 3.10.15 (main, Oct 3 2024, 02:24:49) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.6.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] executorch==0.5.0a0+d243ffe
[pip3] numpy==1.21.3
[pip3] torch==2.6.0.dev20241112
[pip3] torchaudio==2.5.0.dev20241112
[pip3] torchsr==1.0.4
[pip3] torchvision==0.20.0.dev20241112
[conda] executorch 0.5.0a0+d243ffe pypi_0 pypi
[conda] numpy 1.21.3 pypi_0 pypi
[conda] torch 2.6.0.dev20241112 pypi_0 pypi
[conda] torchaudio 2.5.0.dev20241112 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.20.0.dev20241112 pypi_0 pypi

cc @larryliu0820 @lucylq @digantdesai @freddan80 @per @zingo @oscarandersson8218

The text was updated successfully, but these errors were encountered:

ChristophKarlHeck · 2024-12-04T16:22:38Z

@AIWintermuteAI
I run the add model on cortex m4, but sometimes I am getting a weired hard fault which is related to the combination with Mbed os. I will keep you posted!

zingo · 2024-12-04T18:29:52Z

Hi without looking to much into your patch yet (will check more tomorrow when I get to my computer) here are some fast notes. Currently Arm Baremetal lower to an NPU only (e.g. Ethos-U55 or U85) so currently the Executorch standard C++ version of the ops has to be used. The is to change this over time so you work is super valuable to get this started.

The current setup naming is a bit misleading as the delegate lowers to TOSA and then the TOSA output is given to the Ethos-U Vela compiler. So the current "Arm Baremetal" is right now more of a "Arm TOSA Baremetal" and we lack the proper Coretex-M "Arm Baremetal" as you discovered.

One fast way around this could be to just say that we don't support any ops at all when asked
e.g. is_node_supported() in
backends/arm/operator_support/tosa_supported_operators.py

should probably always return false.

Then the CPU ops will be selected instead. They are not Cortex-M optimized but pure generic versions for now but should hopefully work. This is a bit awkward I know but this allows you to get the rest of the build system and stuff for free in the setup, as I see you have done in the patch.

With run.sh you can add a comma separated list of ops you need with --portable_kernels if you try to run it you will get serial logs from the system printing the "string" name of the atan operator that you need to add.

In the long run some (u)int8 quantized operators for Cortex-M should be added in an good way replacing the generic portable ops in a good way, so you starting and getting this going this is great. One thing to figure out is if its best to make a special Cortex-M delegate or if a delegate is to heavy and something along the line of just a lib like the current portable ops should be used. There are pros (a delegate can have states/quantizer) and cons (it bigger to get going, maybe add more stuff then needed).

I do not clearly understand why linear and add models fail to run on the hardware, while softmax succeeds.

I would guess its because Softmax is kind of just hardcoded in the examples/arm/aot_arm_compiler.py code run:ed by the run.sh script as a CPU only test, so that is probably why it works out of the box and linear is supported so it never tried to run a CPU version as is_node_supported() tells Executorch that the NPU can handle it when it asks.

Also the 1.45 Mb allocation for quantized MobileNetv2 alpha 0.05 96x96x3 seems excessive... Is that indeed current limitation due to executorch engine overhead or have I made a mistake?

We see big sizes here also and have not investigated it yet so I don't think you made any mistake, there could be some simple issue here.

Thanks for starting this!

zingo · 2024-12-05T07:01:09Z

Getting some time with you change I see you already found --portable_kernels and set --aot_arm_compiler_flags="" That works to trick it to not do any delegation to the NPU, that was neat! So forget most of the stuff I wrote yesterday evening, or see it as general info for other people looking into the problem.

If you add --aot_arm_compiler_flags="--debug" you will get a table with delegated/non delegated ops to double check if you want to make sure no NPU code was addede.

Running it with linear as you did I see that it all lowered to non delegated ops e.g. aten_addmm_default and aten_permute_copy_default and there is no NPU delegated ops so that code should not kick in e.g. it seem fine from that point of view and I don't know what causing it to crash. I also see that this match you list for portable_kernels to that seems fine/correct.

AIWintermuteAI · 2024-12-05T09:33:19Z

Thank you for the fast reply!
Yeah, there is likely a memory issue then or a hardfault at the model execution stage. I understand that you are able to build the code from my branch, but have no hardware to run it on?

zingo · 2024-12-05T12:27:41Z

I have only checked your branch in the web-browser and did some own tests on my own branch/code using the "default" Corstone300 FVP simulator flow.
I'm in the middle of some other/unrelated work but could not resist a fast check of you problem. :)

zingo · 2025-01-20T11:04:42Z

Hi, sorry for the late checkup, Im a bit interested how it's going and if you found something out?

AIWintermuteAI · 2025-01-31T14:00:30Z

@zingo
Not really, I'm actually waiting for someone from Executorch to take a look :)
I could continue digging, but I have other tasks at the moment, so I just moved it to Blocked. If nothing happens I probably can give it another try in a few month, maybe the issue is solved by then.

jackzhxng · 2025-02-14T02:04:49Z

cc @digantdesai

digantdesai · 2025-02-14T03:39:36Z

I do not clearly understand why linear and add models fail to run on the hardware, while softmax succeeds.

I run the add model on cortex m4, but sometimes I am getting a weired hard fault which is related to the combination with Mbed os.

Thanks @zingo for reproducing this. I suspect Softmax is in the CI and tested e2e with M4 + FVP so it is expected to run OK. For Linear, did you check the size of the portable op lib linked with the executorch runtime? I suspect, it might be bringing in logic for all the dtypes when you might need to run addmm with just one dtype. This is something I can reproduce on my end as well.

Also the 1.45 Mb allocation for quantized MobileNetv2 alpha 0.05 96x96x3 seems excessive... Is that indeed current limitation due to executorch engine overhead or have I made a mistake?

I am assuming you are referring to 1.45MiB PTE size? That seems reasonable for int8? Or you are talking about runtime planned memory size, which you should be able to get that from here when running?

If latter, I am not surprised given your repo might be bit old, we recently landed some improvements in the memory planner, try with 7926. That said, I admit it can be improved, for example Cadence backend has a different memory planner, arguably more suited for small systems.

dbort added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm bug actionable Items in the backlog waiting for an appropriate impl/fix triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Dec 4, 2024

zingo mentioned this issue Dec 23, 2024

ARM ML embedded evaluation Kit support #7423

Open

jackzhxng added the module: build/install Issues related to the cmake and buck2 builds, and to installing ExecuTorch label Feb 14, 2025

jackzhxng assigned digantdesai Feb 14, 2025

jackzhxng removed the bug label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with deloyment on RP2040 #7177

Issues with deloyment on RP2040 #7177

AIWintermuteAI commented Dec 4, 2024 •

edited by pytorch-bot bot

Loading

ChristophKarlHeck commented Dec 4, 2024

zingo commented Dec 4, 2024 •

edited

Loading

zingo commented Dec 5, 2024 •

edited

Loading

AIWintermuteAI commented Dec 5, 2024 •

edited

Loading

zingo commented Dec 5, 2024 •

edited

Loading

zingo commented Jan 20, 2025

AIWintermuteAI commented Jan 31, 2025

jackzhxng commented Feb 14, 2025

digantdesai commented Feb 14, 2025

Issues with deloyment on RP2040 #7177

Issues with deloyment on RP2040 #7177

Comments

AIWintermuteAI commented Dec 4, 2024 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

ChristophKarlHeck commented Dec 4, 2024

zingo commented Dec 4, 2024 • edited Loading

zingo commented Dec 5, 2024 • edited Loading

AIWintermuteAI commented Dec 5, 2024 • edited Loading

zingo commented Dec 5, 2024 • edited Loading

zingo commented Jan 20, 2025

AIWintermuteAI commented Jan 31, 2025

jackzhxng commented Feb 14, 2025

digantdesai commented Feb 14, 2025

AIWintermuteAI commented Dec 4, 2024 •

edited by pytorch-bot bot

Loading

zingo commented Dec 4, 2024 •

edited

Loading

zingo commented Dec 5, 2024 •

edited

Loading

AIWintermuteAI commented Dec 5, 2024 •

edited

Loading

zingo commented Dec 5, 2024 •

edited

Loading