Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Big-endian support #518

Closed
amitdo opened this issue Dec 1, 2016 · 30 comments
Closed

LSTM: Big-endian support #518

amitdo opened this issue Dec 1, 2016 · 30 comments

Comments

@amitdo
Copy link
Collaborator

amitdo commented Dec 1, 2016

https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#for-open-source-contributors

The swaps are missing from the (De)Serialize methods of the new neural network code.

@stweil
Copy link
Member

stweil commented Feb 5, 2017

There are different approaches possible to get support for big endian machines:

  1. Write training data files in native endian byte order. When reading that data Tesseract must automatically detect the endianness used and convert it to native byte order if necessary, that means if training (=writing) machine and OCR (=reading) machine use different byte order.
  2. Write training data files with fixed endianness. When reading that data Tesseract must only convert the data if it uses a different byte order.

The current code obviously tries to implement the first variant: it uses swap parameters for the functions which read data and sets the value of swap based on the training data.

I prefer the second variant and suggest to always use little endian training data files. Then the most common little endian platforms can use the data without the need to do byte swaps. Big endian hosts can use fixed code to convert the data when reading or writing them. This results in less complex code: the swap parameters are no longer needed.

My endian branch includes experimental changes which remove (most) swap parameters and implement the read part for big endian machines. I could successfully run api/tesseract testing/phototest.tif stdout on a big endian S390X (QEMU emulation).

@theraysmith, if you agree to switch to that different solution for the endianness problem, I'd continue with the write part. In a first PR I'd remove the current swap code.

@theraysmith
Copy link
Contributor

How does the big-endian machine know which values are of which sizes, and types and therefore how to swap? Please explain how that works if the swaps are all removed from the deserialize methods.
I couldn't find any replacement swapping code in your pull request.

@stweil
Copy link
Member

stweil commented Feb 6, 2017

That's done by the function convert2le (see PR #703) which is overloaded for the relevant data types. Depending on the macro WORDS_BIGENDIAN that function either swaps bytes or does nothing.

@theraysmith
Copy link
Contributor

You said you successfully ran on phototest.tif on a big-endian machine.
Is that using the LSTM implementation?
I don't see how that could possibly succeed as separated deserialize/swap, since matrix.h has to swap the sizes that it deserializes in order to know how big a matrix to deserialize.
I don' think this is a good idea. Most of the swap code required for the LSTM implementation is already in place, so to support big-endian from here isn't difficult really. Certainly a lot easier than making this work.

@stweil
Copy link
Member

stweil commented Feb 7, 2017

My test used the default (Tesseract + LSTM). My code includes changes to matrix.h which do the swapping needed. Did you read my comments above about the drawbacks of the current implementation?

@jbreiden
Copy link
Contributor

jbreiden commented Feb 7, 2017

Three small points.

From a software distribution standpoint, it is pretty burdensome to require
different data files for big endian architectures. If that's true for 3.04 then
I've been shipping broken software.

Microsoft and Torvalds have almost entirely killed off big endian machines
by now. For example, Solaris development was killed off in December
according to the press.

Any efforts on this topic get cut in half if the non-LSTM recognizer is removed.

@stweil
Copy link
Member

stweil commented Feb 7, 2017

I strongly vote against removing non-LSTM as we currently still get better results with it in some cases.

Technically it is possible to have BE and LE machines creating different training data as long as both are able to fix that during read. But this is one of the drawbacks: each Tesseract must be able to read both variants of training files (which results in less efficient code). In addition it is more difficult to compare the training output from LE and BE machines.

@stweil
Copy link
Member

stweil commented Feb 7, 2017

@theraysmith, a new test with explicit --OEM 1 just passed successfully on my BE machine.

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 7, 2017

Ray, you can see online Stefan's suggested changes here.

@stweil
Copy link
Member

stweil commented Feb 7, 2017

Ray, you can see online Stefan's suggested changes here.

or here. In the meantime I have improved that experimental code further (based on PR #706) and will send an update later. It still only addresses reading, but adding write support to enforce little endian training data files is rather easy following the same scheme.

@jbreiden, did you ever test 3.05 or older versions with big endian machines? If not, I can run a test on my s390x emulation.

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 7, 2017

Protocol Buffers uses Stefan's approach:
https://groups.google.com/forum/#!topic/protobuf/XbzBwCj4yL8

@stweil
Copy link
Member

stweil commented Feb 7, 2017

I assume that most binary file formats do. See also https://en.wikipedia.org/wiki/Endianness#Files_and_byte_swap.

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 7, 2017

True. I mentioned PB because it is written by Google and they use it extensively.

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 7, 2017

Microsoft and Torvalds have almost entirely killed off big endian machines
by now.

Another victim is IBM POWER.

https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-44a4f27eba32/entry/just_the_faqs_about_little_endian?lang=en

Why is Linux on Power transitioning from big endian to little endian?

The Power architecture is bi-endian in that it supports accessing data in both little endian and big endian modes. Although Power already has Linux distributions and supporting applications that run in big endian mode, the Linux application ecosystem for x86 platforms is much larger and Linux on x86 uses little endian mode. Numerous clients, software partners, and IBM’s own software developers have told us that porting their software to Power becomes simpler if the Linux environment on Power supports little endian mode, more closely matching the environment provided by Linux on x86. This new level of support will lower the barrier to entry for porting Linux on x86 software to Linux on Power.

@theraysmith
Copy link
Contributor

Please provide examples of where you get better results with the old engine.
Right now I'm trying to work on getting rid of redundant code, rather than spending time fighting needless changes that generate a lot of work. I have recently tested an LSTM-based OSD, and it works a lot better than the old, so that is one more use of the old classifier that can go. AFAICT, apart from the equation detector, the old classifier is now redundant.

I disagree with the assessment that protocol buffers use Stefan's method, as I still haven't had an explanation of nor seen code to show how the reading of a little-endian file on a big-endian machine works. This comment on proto buffers "Yep. On the wire, things are encoded little-endian, but the encoding and decoding routines will convert to and from your machine's format themselves, so you don't need to worry about it." sounds exactly like what the code currently does.

I don't think having a collection of big-endian data files works if that is the proposal. That would be very ugly. What exactly is the proposal for classes like Matrix? Basically the swaps have to stay in place, but could be predicated on an #ifdef instead of runtime data. Even that would make the code ugly, and I don't see a huge amount of CPU being burnt testing if (swap), so I don't really see what all the fuss is about over code efficiency compared to the wasted effort messing about with the code.

In summary, I haven't seen a coherent, convincing argument that there is anything wrong with the current solution of the code only swaps the data if it needs to, which most of the time it doesn't because all data files are little-endian and almost all machines are little-endian.

@stweil
Copy link
Member

stweil commented Feb 7, 2017

Please provide examples of where you get better results with the old engine.

I'll do that in the discussion of the new issue #707.

What exactly is the proposal for classes like Matrix?

The current implementation uses DeSerialize functions with swap parameters and calls ReverseN if that parameter is true, see example.

My experimental implementation removes the swap parameters and the code based on them in ccstruct/matrix.h which IMHO makes the code much prettier. Of course we still need swapping for big endian machines (otherwise my OCR tests on such machines would fail). That is done by replacing all fread function calls by different ones which I also called fread in my experimental code but which should be renamed into DeSerialize, see example. Those new functions both read (using the normal fread function) and convert to little endian (using new functions convert2le). They are implemented in the new file ccutil/tessio.cpp.

The new code simply uses new functions for all reads from file, so getting all cases where swapping is needed is much simpler than in the current implementation. The same can be done on writing to achieve little endian data files, no matter what endianness the host is using.

@jbreiden
Copy link
Contributor

jbreiden commented Feb 7, 2017

did you ever test 3.05 or older versions with big endian machines?

No, I don't currently have my hands on a big endian machine. I'm pretty sure that I could get my hands on one, but so far I have not made the effort. As a side note, I'm helping ship Tesseract is on 23 different architectures, some of them big endian. They all share the same data files. However, the big endian platforms are very rarely used and there may not be bug reports, even if totally broken.

https://buildd.debian.org/status/package.php?p=tesseract&suite=unstable

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 7, 2017

@stweil
Copy link
Member

stweil commented Feb 7, 2017

I now have run a test (tesseract testing/phototest.tif stdout) with Tesseract 3.05 on big endian s390x (QEMU emulation) successfully.

@theraysmith
Copy link
Contributor

OK, now I finally understand your proposal, I like some aspects of it, but I have some suggestions for making it better:

  1. Since all data files are de-facto little-endian, the overloads are only needed on a big-endian machine. I haven't yet found the source code for convert2le, but I get the gist of what you are trying to do.
  2. Move all the overloads of fread to overloads of TFile::FRead. (See ccutil/serialis.h).
  3. Hide all the overloads behind an #ifdef BIG_ENDIAN so the little-endian machines just call the generic (void*) version of FRead. (I haven't researched it thoroughly, but this posting might be better: http://esr.ibiblio.org/?p=5095)
  4. Anything needed by the new engine (should be just the Network and subclasses, Dict/Dawg and UNICHARSET) make sure it loads from a TFile and convert it if it doesn't.
  5. Instead of modifying the old engine parts, concentrate on convincing me why it should stay first, but if you really want to do it, map all the I/O in the old engine to TFile and get rid of the use of FILE.
  6. Does convert2le have any practical use at that point? Surely just call ReverseN directly from the overloads? (Like I said I haven't found the code for convert2le, and I missed the link for it if you provided it.)

Then I agree it would be cleaner, smaller, and more efficient, as well as future-proofed.

@theraysmith theraysmith reopened this Feb 8, 2017
@amitdo
Copy link
Collaborator Author

amitdo commented Feb 8, 2017

He did provide a link for 'convert2le' (convert2le was clickable). Here is the link in a more visable form:
https://github.com/stweil/tesseract/blob/endian/ccutil/tessio.h#L15
A link to tessio.cpp:
https://github.com/stweil/tesseract/blob/endian/ccutil/tessio.cpp

@rfschtkt
Copy link
Contributor

Little-endian is an abomination. It hurts the brain if you haven't grown up with it (like apparently in Arabic, which still pronounces numbers big to little except for units before tens). UTF-8 is necessarily big-endian by design so that lexicographical order is the same whether processing considers bytes (fast, using memcmp) or code points (slow). Packed bitmap representations like BMP and PNG (I only verified those) store the leftmost pixel in the most significant bits of a byte, so loading a range of pixels in a 64-bit integer, shifting them inside the processor, and writing them back only makes sense if integers are big-endian. Intelligent future civilisations are bound to reconsider the current accidental infatuation with little-endian, and will curse their predecessors for having to modify all those inflexible formats and software.

With C++, byte order or pre-evaluated need to reverse order ("bool swap") can easily be hidden in a file object member variable: it does not have to be passed around as a function argument, if that is considered too cumbersome. Has anybody even made a performance analysis before just postulating that little-endian is necessary for optimal or near-optimal performance on little-endian machines, considering that relatively little time is spent on I/O and that data files will also tend to be little-endian?

@rfschtkt
Copy link
Contributor

rfschtkt commented May 1, 2017

I've had a look at stweil's proposal (endian branch), and I'm "not sure" that it will work... unless on a big-endian machine the file is saved twice, or read in again, perhaps because serialisation was the last thing the program did before it was restarted, or otherwise explicitly. The problem seems to be that the byte order is reversed in place, for serialisation as well as deserialisation. If you want to use a fixed endianness, you have to use a separate buffer, or do one of the workarounds described above. And that's if you're sure that each data item is visited exactly once, otherwise the reversal has to occur inside the lowest-level serialisation functions!

I would propose a different solution, where serialisation and deserialisation are unified into a single method, and instead of a FILE pointer you get a handle to an object that can do anything you want. This could easily simplify the existing code, without breaking compatibility with existing data files.

Another thing is to hide endianness handling in a library that can be reused and is maintained by people who do care about this issue. For example, a 64-bit swap in source code is either recognised by the compiler and optimised into a single instruction, or there is a better alternative with fewer operations. It would also be possible to use vector instructions to convert endianness.

@stweil
Copy link
Member

stweil commented May 1, 2017

Well, it works as far as it was implemented and tested. Writing little endian trained data on a big endian host was not implemented. All other use cases (reading little endian on a little or big endian host, writing little endian on a little endian host) work according to my tests.

Nevertheless pull request #703 is obsolete, as Ray is currently working on improving endianness support.

@rfschtkt
Copy link
Contributor

rfschtkt commented May 1, 2017

I would use a subset of boost serialisation: just one serialize() method template exactly as specified in boost (instead of largely redundant separate code for writing and reading), and minimal format-compatible ad-hoc Archive implementations without creating any dependency on boost itself or not yet anyway, except for reusing the idea. You can then hide the endianness policy in the Archive implementation: write host endianness (as it is now), write fixed endianness (as proposed above), write configured endianness (host endianness by default, unless perhaps publishing for reuse, or a default the other way around), write JSON, ... Better than reinventing the wheel with probably more code!

(2017-05-02 Added) Maybe it's a bit late for exactly that, I don't know... But if you want to use TFile, you should also have it do all the endianness handling, so it's never forgotten. If somebody else weren't already on it, I'd give that a try myself.

@theraysmith
Copy link
Contributor

Fixed in commit 8e79297

@stweil
Copy link
Member

stweil commented May 4, 2017

Ray, the new code still uses a dynamic detection in TessdataManager::LoadMemBuffer to decide whether swapping is needed or not. This implies that the code supports both big and little endian data files. The drawback is additional runtime code on all kinds of machines.

Are you planning more changes? I'd drop support of big endian data files in 4.0 and add code to always write little endian ones. Then static swapping code would only be needed on big endian machines, and the large majority of machines would not need any swap code at all.

@theraysmith
Copy link
Contributor

theraysmith commented May 4, 2017 via email

@stweil
Copy link
Member

stweil commented Apr 29, 2018

The latest version of Tesseract crashes on all big endian machines (tested on Debian s390x and mips with the official Debian packages and with git master). See also #1525.

@stweil
Copy link
Member

stweil commented May 4, 2018

See commit 21d5ce5 which fixes the crash when running OCR on a big endian machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants