Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: ARM SIMD support #519

Closed
amitdo opened this issue Dec 1, 2016 · 9 comments
Closed

LSTM: ARM SIMD support #519

amitdo opened this issue Dec 1, 2016 · 9 comments
Labels

Comments

@amitdo
Copy link
Collaborator

amitdo commented Dec 1, 2016

https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#for-open-source-contributors

There is a C++ implementation if the hardware does not have SSE and/or AVX, but the code could benefit from SIMD implementations for other hardware, such as ARM. See the new arch directory for where to insert the code.

@zamazan4ik
Copy link
Contributor

Should we write this code manually nowadays? Modern compilers can optimize SIMD instructions very good wihtout any manual work with intrinsics. User should just compile with -O2/3 and -march=<required_arch>

I think writing a lot of manual assembler/intrinsic isn't a good idea.

@stweil
Copy link
Contributor

stweil commented Jun 3, 2018

Yes, that's correct. It is already possible to do it by providing additional compiler flags when running configure (CXXFLAGS=...). But of course that should happen automatically, and we must take care that the resulting binary can also be used on different hardware. This still has to be implemented.

PS. I have recently ordered a small ARM based cluster for Tesseract OCR, so I'm highly motivated to work on that issue. :-)

@drothlis
Copy link

Enabling NEON optimisations does result in vectorised NEON instructions for WeightMatrix::DotProduct: https://godbolt.org/z/YCUgcb

I'm not sure about IntSimdMatrix::MatrixDotVector -- the code (and assembly) is much harder to follow.

On my ARM device (NVidia Tegra K1) compiling tesseract with NEON optimisations (-mfpu=neon-vfpv4 -mfloat-abi=hard -mcpu=cortex-a15) gave a 10-15% speedup, but the LSTM engine is still 3-10 times slower than the legacy engine: 3-30 seconds (depending on the image size) compared to 1-4 seconds for the legacy engine.

These compiler flags had no measurable effect on the legacy engine.

Adding -O3 (versus the default -O2) resulted in a further 0-20% speedup (depending on image size). In other words, a total speedup of 10-30% over -O2 without NEON. (Still many times slower than the legacy engine.)

For the legacy engine, -O3 gave me a 1-8% speedup.

I used Ubuntu's tesseract package version 4.00~git2288-10f4998a-2 + the english data files from https://github.com/tesseract-ocr/tessdata/tree/590567f2

How I built it, in case it helps anyone:

sudo apt install build-essential devscripts
sudo apt build-dep tesseract-ocr
mkdir /tmp/tesseract
cd /tmp/tesseract
apt source tesseract-ocr
cd tesseract-4.00~git2288-10f4998a
debchange -R "Rebuild with NEON optimisations";
export DEB_CFLAGS_APPEND="-mfpu=neon-vfpv4 -mfloat-abi=hard -mcpu=cortex-a15"
debuild -i -us -uc -b  # creates ../*.deb

@stweil
Copy link
Contributor

stweil commented Feb 12, 2019

I suggest using data files from tessdata_fast instead of those from tessdata. In addition, you could try -c dotproduct=native which should use Neon if you compiled on a Neon machine.

@s6ch13
Copy link

s6ch13 commented Jan 9, 2020

you can find below code which addresses arm neon integer support. This is native implementation of intsimdmatrixneon.cpp along with changes in other files to support this. Once i get my hand on a 64b arm platform, i will work on the arm neon float support (for dotproductneon.cpp). There is about 20% improvement in performance. Please review the code and let me know your comments.

https://github.com/s6ch13/tesseract/tree/arm_neon_support

cheers Sriram

@amitdo amitdo added the SIMD label May 14, 2020
@amitdo
Copy link
Collaborator Author

amitdo commented May 27, 2020

Dot product acceleration using Neon was implemented in f79e52a.

@stweil
Copy link
Contributor

stweil commented May 27, 2020

I'll try to compare the performance of both implementations later. This is an interesting example because the one here simply relies on the compiler while the other one uses handwritten NEON code.

@Shreeshrii
Copy link
Collaborator

@stweil Do you have a result for the comparison?
What are the recommended settings to use for Neon?

@stweil
Copy link
Contributor

stweil commented Nov 21, 2020

Neon is automatically detected and used with the latest code, so no special settings should be required.

And no, sorry, I don't have a comparison result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants