-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with deloyment on RP2040 #7177
Comments
@AIWintermuteAI |
Hi without looking to much into your patch yet (will check more tomorrow when I get to my computer) here are some fast notes. Currently Arm Baremetal lower to an NPU only (e.g. Ethos-U55 or U85) so currently the Executorch standard C++ version of the ops has to be used. The is to change this over time so you work is super valuable to get this started. The current setup naming is a bit misleading as the delegate lowers to TOSA and then the TOSA output is given to the Ethos-U Vela compiler. So the current "Arm Baremetal" is right now more of a "Arm TOSA Baremetal" and we lack the proper Coretex-M "Arm Baremetal" as you discovered. One fast way around this could be to just say that we don't support any ops at all when asked should probably always return false. Then the CPU ops will be selected instead. They are not Cortex-M optimized but pure generic versions for now but should hopefully work. This is a bit awkward I know but this allows you to get the rest of the build system and stuff for free in the setup, as I see you have done in the patch. With run.sh you can add a comma separated list of ops you need with --portable_kernels if you try to run it you will get serial logs from the system printing the "string" name of the atan operator that you need to add. In the long run some (u)int8 quantized operators for Cortex-M should be added in an good way replacing the generic portable ops in a good way, so you starting and getting this going this is great. One thing to figure out is if its best to make a special Cortex-M delegate or if a delegate is to heavy and something along the line of just a lib like the current portable ops should be used. There are pros (a delegate can have states/quantizer) and cons (it bigger to get going, maybe add more stuff then needed).
I would guess its because Softmax is kind of just hardcoded in the examples/arm/aot_arm_compiler.py code run:ed by the run.sh script as a CPU only test, so that is probably why it works out of the box and linear is supported so it never tried to run a CPU version as is_node_supported() tells Executorch that the NPU can handle it when it asks.
We see big sizes here also and have not investigated it yet so I don't think you made any mistake, there could be some simple issue here. Thanks for starting this! |
Getting some time with you change I see you already found --portable_kernels and set --aot_arm_compiler_flags="" That works to trick it to not do any delegation to the NPU, that was neat! So forget most of the stuff I wrote yesterday evening, or see it as general info for other people looking into the problem. If you add --aot_arm_compiler_flags="--debug" you will get a table with delegated/non delegated ops to double check if you want to make sure no NPU code was addede. Running it with linear as you did I see that it all lowered to non delegated ops e.g. aten_addmm_default and aten_permute_copy_default and there is no NPU delegated ops so that code should not kick in e.g. it seem fine from that point of view and I don't know what causing it to crash. I also see that this match you list for portable_kernels to that seems fine/correct. |
Thank you for the fast reply! |
I have only checked your branch in the web-browser and did some own tests on my own branch/code using the "default" Corstone300 FVP simulator flow. |
Hi, sorry for the late checkup, Im a bit interested how it's going and if you found something out? |
@zingo |
cc @digantdesai |
Thanks @zingo for reproducing this. I suspect Softmax is in the CI and tested e2e with M4 + FVP so it is expected to run OK. For Linear, did you check the size of the portable op lib linked with the executorch runtime? I suspect, it might be bringing in logic for all the dtypes when you might need to run addmm with just one dtype. This is something I can reproduce on my end as well.
I am assuming you are referring to 1.45MiB PTE size? That seems reasonable for int8? Or you are talking about runtime planned memory size, which you should be able to get that from here when running? If latter, I am not surprised given your repo might be bit old, we recently landed some improvements in the memory planner, try with 7926. That said, I admit it can be improved, for example Cadence backend has a different memory planner, arguably more suited for small systems. |
🐛 Describe the bug
I attempted (with mixed success) to deploy ARM examples to a regular MCU, without any special DSP or NN accelerator. I chose RP2040, because it's build system is centered around CMake, so it was easier to modify existing example.
I uploaded my code to https://github.com/AIWintermuteAI/executorch/tree/port-to-rp2040, it should be easy enough to reproduce following the instructions.
For convenience, I'm also copying the results and the issues encountered here.
Softmax builds and runs normally
Linear and add hang at Starting the model execution.
Quantized MobileNetv2 alpha 0.05 96x96x3 requires allocation of 1.45 Mb of RAM.
Related issue:
#3585
Some work being done here (thanks for the support, @ChristophKarlHeck!)
https://github.com/ChristophKarlHeck/mbed-torch-fusion-os/tree/main
But I'm also seeing only softmax example - @ChristophKarlHeck were you able to make other models work on M4?
CC @zingo as I think you also worked on ARM example?
Versions
executorch % python collect_env.py
Collecting environment information...
PyTorch version: 2.6.0.dev20241112
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 14.6.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.1.0.2.5)
CMake version: version 3.30.5
Libc version: N/A
Python version: 3.10.15 (main, Oct 3 2024, 02:24:49) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.6.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M1 Pro
Versions of relevant libraries:
[pip3] executorch==0.5.0a0+d243ffe
[pip3] numpy==1.21.3
[pip3] torch==2.6.0.dev20241112
[pip3] torchaudio==2.5.0.dev20241112
[pip3] torchsr==1.0.4
[pip3] torchvision==0.20.0.dev20241112
[conda] executorch 0.5.0a0+d243ffe pypi_0 pypi
[conda] numpy 1.21.3 pypi_0 pypi
[conda] torch 2.6.0.dev20241112 pypi_0 pypi
[conda] torchaudio 2.5.0.dev20241112 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.20.0.dev20241112 pypi_0 pypi
cc @larryliu0820 @lucylq @digantdesai @freddan80 @per @zingo @oscarandersson8218
The text was updated successfully, but these errors were encountered: