LSTM: Big-endian support #518

amitdo · 2016-12-01T13:30:14Z

https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#for-open-source-contributors

The swaps are missing from the (De)Serialize methods of the new neural network code.

stweil · 2017-02-05T11:41:47Z

There are different approaches possible to get support for big endian machines:

Write training data files in native endian byte order. When reading that data Tesseract must automatically detect the endianness used and convert it to native byte order if necessary, that means if training (=writing) machine and OCR (=reading) machine use different byte order.
Write training data files with fixed endianness. When reading that data Tesseract must only convert the data if it uses a different byte order.

The current code obviously tries to implement the first variant: it uses swap parameters for the functions which read data and sets the value of swap based on the training data.

I prefer the second variant and suggest to always use little endian training data files. Then the most common little endian platforms can use the data without the need to do byte swaps. Big endian hosts can use fixed code to convert the data when reading or writing them. This results in less complex code: the swap parameters are no longer needed.

My endian branch includes experimental changes which remove (most) swap parameters and implement the read part for big endian machines. I could successfully run api/tesseract testing/phototest.tif stdout on a big endian S390X (QEMU emulation).

@theraysmith, if you agree to switch to that different solution for the endianness problem, I'd continue with the write part. In a first PR I'd remove the current swap code.

theraysmith · 2017-02-06T20:55:16Z

How does the big-endian machine know which values are of which sizes, and types and therefore how to swap? Please explain how that works if the swaps are all removed from the deserialize methods.
I couldn't find any replacement swapping code in your pull request.

stweil · 2017-02-06T21:03:18Z

That's done by the function convert2le (see PR #703) which is overloaded for the relevant data types. Depending on the macro WORDS_BIGENDIAN that function either swaps bytes or does nothing.

theraysmith · 2017-02-07T01:21:24Z

You said you successfully ran on phototest.tif on a big-endian machine.
Is that using the LSTM implementation?
I don't see how that could possibly succeed as separated deserialize/swap, since matrix.h has to swap the sizes that it deserializes in order to know how big a matrix to deserialize.
I don' think this is a good idea. Most of the swap code required for the LSTM implementation is already in place, so to support big-endian from here isn't difficult really. Certainly a lot easier than making this work.

stweil · 2017-02-07T05:55:36Z

My test used the default (Tesseract + LSTM). My code includes changes to matrix.h which do the swapping needed. Did you read my comments above about the drawbacks of the current implementation?

jbreiden · 2017-02-07T06:06:11Z

Three small points.

From a software distribution standpoint, it is pretty burdensome to require
different data files for big endian architectures. If that's true for 3.04 then
I've been shipping broken software.

Microsoft and Torvalds have almost entirely killed off big endian machines
by now. For example, Solaris development was killed off in December
according to the press.

Any efforts on this topic get cut in half if the non-LSTM recognizer is removed.

stweil · 2017-02-07T06:15:22Z

I strongly vote against removing non-LSTM as we currently still get better results with it in some cases.

Technically it is possible to have BE and LE machines creating different training data as long as both are able to fix that during read. But this is one of the drawbacks: each Tesseract must be able to read both variants of training files (which results in less efficient code). In addition it is more difficult to compare the training output from LE and BE machines.

stweil · 2017-02-07T06:16:03Z

@theraysmith, a new test with explicit --OEM 1 just passed successfully on my BE machine.

amitdo · 2017-02-07T07:21:57Z

Ray, you can see online Stefan's suggested changes here.

stweil · 2017-02-07T09:25:19Z

Ray, you can see online Stefan's suggested changes here.

or here. In the meantime I have improved that experimental code further (based on PR #706) and will send an update later. It still only addresses reading, but adding write support to enforce little endian training data files is rather easy following the same scheme.

@jbreiden, did you ever test 3.05 or older versions with big endian machines? If not, I can run a test on my s390x emulation.

amitdo · 2017-02-07T11:11:58Z

Protocol Buffers uses Stefan's approach:
https://groups.google.com/forum/#!topic/protobuf/XbzBwCj4yL8

stweil · 2017-02-07T11:29:43Z

I assume that most binary file formats do. See also https://en.wikipedia.org/wiki/Endianness#Files_and_byte_swap.

amitdo · 2017-02-07T11:39:17Z

True. I mentioned PB because it is written by Google and they use it extensively.

amitdo · 2017-02-07T15:27:14Z

Microsoft and Torvalds have almost entirely killed off big endian machines
by now.

Another victim is IBM POWER.

https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-44a4f27eba32/entry/just_the_faqs_about_little_endian?lang=en

Why is Linux on Power transitioning from big endian to little endian?

The Power architecture is bi-endian in that it supports accessing data in both little endian and big endian modes. Although Power already has Linux distributions and supporting applications that run in big endian mode, the Linux application ecosystem for x86 platforms is much larger and Linux on x86 uses little endian mode. Numerous clients, software partners, and IBM’s own software developers have told us that porting their software to Power becomes simpler if the Linux environment on Power supports little endian mode, more closely matching the environment provided by Linux on x86. This new level of support will lower the barrier to entry for porting Linux on x86 software to Linux on Power.

theraysmith · 2017-02-07T17:04:23Z

Please provide examples of where you get better results with the old engine.
Right now I'm trying to work on getting rid of redundant code, rather than spending time fighting needless changes that generate a lot of work. I have recently tested an LSTM-based OSD, and it works a lot better than the old, so that is one more use of the old classifier that can go. AFAICT, apart from the equation detector, the old classifier is now redundant.

I disagree with the assessment that protocol buffers use Stefan's method, as I still haven't had an explanation of nor seen code to show how the reading of a little-endian file on a big-endian machine works. This comment on proto buffers "Yep. On the wire, things are encoded little-endian, but the encoding and decoding routines will convert to and from your machine's format themselves, so you don't need to worry about it." sounds exactly like what the code currently does.

I don't think having a collection of big-endian data files works if that is the proposal. That would be very ugly. What exactly is the proposal for classes like Matrix? Basically the swaps have to stay in place, but could be predicated on an #ifdef instead of runtime data. Even that would make the code ugly, and I don't see a huge amount of CPU being burnt testing if (swap), so I don't really see what all the fuss is about over code efficiency compared to the wasted effort messing about with the code.

In summary, I haven't seen a coherent, convincing argument that there is anything wrong with the current solution of the code only swaps the data if it needs to, which most of the time it doesn't because all data files are little-endian and almost all machines are little-endian.

stweil · 2017-02-07T18:19:40Z

Please provide examples of where you get better results with the old engine.

I'll do that in the discussion of the new issue #707.

What exactly is the proposal for classes like Matrix?

The current implementation uses DeSerialize functions with swap parameters and calls ReverseN if that parameter is true, see example.

My experimental implementation removes the swap parameters and the code based on them in ccstruct/matrix.h which IMHO makes the code much prettier. Of course we still need swapping for big endian machines (otherwise my OCR tests on such machines would fail). That is done by replacing all fread function calls by different ones which I also called fread in my experimental code but which should be renamed into DeSerialize, see example. Those new functions both read (using the normal fread function) and convert to little endian (using new functions convert2le). They are implemented in the new file ccutil/tessio.cpp.

The new code simply uses new functions for all reads from file, so getting all cases where swapping is needed is much simpler than in the current implementation. The same can be done on writing to achieve little endian data files, no matter what endianness the host is using.

jbreiden · 2017-02-07T19:04:48Z

did you ever test 3.05 or older versions with big endian machines?

No, I don't currently have my hands on a big endian machine. I'm pretty sure that I could get my hands on one, but so far I have not made the effort. As a side note, I'm helping ship Tesseract is on 23 different architectures, some of them big endian. They all share the same data files. However, the big endian platforms are very rarely used and there may not be bug reports, even if totally broken.

https://buildd.debian.org/status/package.php?p=tesseract&suite=unstable

amitdo · 2017-02-07T19:24:24Z

You can use QEMU for testing.
https://qemu.weilnetz.de/doc/qemu-doc.html#QEMU-System-emulator-for-non-PC-targets

stweil · 2017-02-07T21:07:00Z

I now have run a test (tesseract testing/phototest.tif stdout) with Tesseract 3.05 on big endian s390x (QEMU emulation) successfully.

theraysmith · 2017-02-08T01:02:17Z

OK, now I finally understand your proposal, I like some aspects of it, but I have some suggestions for making it better:

Since all data files are de-facto little-endian, the overloads are only needed on a big-endian machine. I haven't yet found the source code for convert2le, but I get the gist of what you are trying to do.
Move all the overloads of fread to overloads of TFile::FRead. (See ccutil/serialis.h).
Hide all the overloads behind an #ifdef BIG_ENDIAN so the little-endian machines just call the generic (void*) version of FRead. (I haven't researched it thoroughly, but this posting might be better: http://esr.ibiblio.org/?p=5095)
Anything needed by the new engine (should be just the Network and subclasses, Dict/Dawg and UNICHARSET) make sure it loads from a TFile and convert it if it doesn't.
Instead of modifying the old engine parts, concentrate on convincing me why it should stay first, but if you really want to do it, map all the I/O in the old engine to TFile and get rid of the use of FILE.
Does convert2le have any practical use at that point? Surely just call ReverseN directly from the overloads? (Like I said I haven't found the code for convert2le, and I missed the link for it if you provided it.)

Then I agree it would be cleaner, smaller, and more efficient, as well as future-proofed.

amitdo · 2017-02-08T07:23:33Z

He did provide a link for 'convert2le' (convert2le was clickable). Here is the link in a more visable form:
https://github.com/stweil/tesseract/blob/endian/ccutil/tessio.h#L15
A link to tessio.cpp:
https://github.com/stweil/tesseract/blob/endian/ccutil/tessio.cpp

rfschtkt · 2017-04-27T04:33:34Z

Little-endian is an abomination. It hurts the brain if you haven't grown up with it (like apparently in Arabic, which still pronounces numbers big to little except for units before tens). UTF-8 is necessarily big-endian by design so that lexicographical order is the same whether processing considers bytes (fast, using memcmp) or code points (slow). Packed bitmap representations like BMP and PNG (I only verified those) store the leftmost pixel in the most significant bits of a byte, so loading a range of pixels in a 64-bit integer, shifting them inside the processor, and writing them back only makes sense if integers are big-endian. Intelligent future civilisations are bound to reconsider the current accidental infatuation with little-endian, and will curse their predecessors for having to modify all those inflexible formats and software.

With C++, byte order or pre-evaluated need to reverse order ("bool swap") can easily be hidden in a file object member variable: it does not have to be passed around as a function argument, if that is considered too cumbersome. Has anybody even made a performance analysis before just postulating that little-endian is necessary for optimal or near-optimal performance on little-endian machines, considering that relatively little time is spent on I/O and that data files will also tend to be little-endian?

rfschtkt · 2017-05-01T16:37:38Z

I've had a look at stweil's proposal (endian branch), and I'm "not sure" that it will work... unless on a big-endian machine the file is saved twice, or read in again, perhaps because serialisation was the last thing the program did before it was restarted, or otherwise explicitly. The problem seems to be that the byte order is reversed in place, for serialisation as well as deserialisation. If you want to use a fixed endianness, you have to use a separate buffer, or do one of the workarounds described above. And that's if you're sure that each data item is visited exactly once, otherwise the reversal has to occur inside the lowest-level serialisation functions!

I would propose a different solution, where serialisation and deserialisation are unified into a single method, and instead of a FILE pointer you get a handle to an object that can do anything you want. This could easily simplify the existing code, without breaking compatibility with existing data files.

Another thing is to hide endianness handling in a library that can be reused and is maintained by people who do care about this issue. For example, a 64-bit swap in source code is either recognised by the compiler and optimised into a single instruction, or there is a better alternative with fewer operations. It would also be possible to use vector instructions to convert endianness.

stweil · 2017-05-01T16:54:13Z

Well, it works as far as it was implemented and tested. Writing little endian trained data on a big endian host was not implemented. All other use cases (reading little endian on a little or big endian host, writing little endian on a little endian host) work according to my tests.

Nevertheless pull request #703 is obsolete, as Ray is currently working on improving endianness support.

rfschtkt · 2017-05-01T17:34:34Z

I would use a subset of boost serialisation: just one serialize() method template exactly as specified in boost (instead of largely redundant separate code for writing and reading), and minimal format-compatible ad-hoc Archive implementations without creating any dependency on boost itself or not yet anyway, except for reusing the idea. You can then hide the endianness policy in the Archive implementation: write host endianness (as it is now), write fixed endianness (as proposed above), write configured endianness (host endianness by default, unless perhaps publishing for reuse, or a default the other way around), write JSON, ... Better than reinventing the wheel with probably more code!

(2017-05-02 Added) Maybe it's a bit late for exactly that, I don't know... But if you want to use TFile, you should also have it do all the endianness handling, so it's never forgotten. If somebody else weren't already on it, I'd give that a try myself.

theraysmith · 2017-05-03T23:12:34Z

Fixed in commit 8e79297

stweil · 2017-05-04T05:02:02Z

Ray, the new code still uses a dynamic detection in TessdataManager::LoadMemBuffer to decide whether swapping is needed or not. This implies that the code supports both big and little endian data files. The drawback is additional runtime code on all kinds of machines.

Are you planning more changes? I'd drop support of big endian data files in 4.0 and add code to always write little endian ones. Then static swapping code would only be needed on big endian machines, and the large majority of machines would not need any swap code at all.

theraysmith · 2017-05-04T05:43:55Z

The code that is there now is far simpler and cleaner than anything that was there before. While there is still minor overhead in deserializing, it is small, as there is only one check for each array. To write only little-endian files would be a lot more work on the dead code, so it isn't worth it. This was already a lot of work on the dead code as it is.

…

On Wed, May 3, 2017 at 10:02 PM, Stefan Weil ***@***.***> wrote: Ray, the new code still uses a dynamic detection in TessdataManager::LoadMemBuffer to decide whether swapping is needed or not. This implies that the code supports both big and little endian data files. The drawback is additional runtime code on all kinds of machines. Are you planning more changes? I'd drop support of big endian data files in 4.0 and add code to always write little endian ones. Then static swapping code would only be needed on big endian machines, and the large majority of machines would not need any swap code at all. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#518 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056UZK6qi93qEj5bdhFUjJ6ULWHoxpks5r2VvVgaJpZM4LBZNs> .

-- Ray.

stweil · 2018-04-29T06:44:23Z

The latest version of Tesseract crashes on all big endian machines (tested on Debian s390x and mips with the official Debian packages and with git master). See also #1525.

stweil · 2018-05-04T04:44:15Z

See commit 21d5ce5 which fixes the crash when running OCR on a big endian machine.

stweil mentioned this issue Feb 5, 2017

Add experimental support for big endian hosts (read part only) #703

Closed

theraysmith closed this as completed Feb 7, 2017

amitdo mentioned this issue Feb 7, 2017

RFC: Remove the legacy OCR Engine #707

Closed

theraysmith reopened this Feb 8, 2017

amitdo mentioned this issue Mar 6, 2017

Replace alloc_struct, free_struct #744

Merged

stweil mentioned this issue Mar 9, 2017

Remove some unused local variables and functions #752

Merged

theraysmith closed this as completed May 3, 2017

stweil mentioned this issue Apr 30, 2018

RFC: Tesseract 4.0.0 – open tasks #1423

Closed

amitdo added the endianness label Jan 9, 2021

amitdo mentioned this issue Sep 6, 2022

Always do serialization in little endian order #3917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM: Big-endian support #518

LSTM: Big-endian support #518

amitdo commented Dec 1, 2016

stweil commented Feb 5, 2017 •

edited

Loading

theraysmith commented Feb 6, 2017

stweil commented Feb 6, 2017 •

edited

Loading

theraysmith commented Feb 7, 2017

stweil commented Feb 7, 2017 •

edited

Loading

jbreiden commented Feb 7, 2017

stweil commented Feb 7, 2017 •

edited

Loading

stweil commented Feb 7, 2017

amitdo commented Feb 7, 2017

stweil commented Feb 7, 2017

amitdo commented Feb 7, 2017

stweil commented Feb 7, 2017

amitdo commented Feb 7, 2017

amitdo commented Feb 7, 2017

theraysmith commented Feb 7, 2017

stweil commented Feb 7, 2017 •

edited

Loading

jbreiden commented Feb 7, 2017

amitdo commented Feb 7, 2017

stweil commented Feb 7, 2017

theraysmith commented Feb 8, 2017

amitdo commented Feb 8, 2017

rfschtkt commented Apr 27, 2017

rfschtkt commented May 1, 2017 •

edited

Loading

stweil commented May 1, 2017

rfschtkt commented May 1, 2017 •

edited

Loading

theraysmith commented May 3, 2017

stweil commented May 4, 2017

theraysmith commented May 4, 2017 via email

stweil commented Apr 29, 2018

stweil commented May 4, 2018

LSTM: Big-endian support #518

LSTM: Big-endian support #518

Comments

amitdo commented Dec 1, 2016

stweil commented Feb 5, 2017 • edited Loading

theraysmith commented Feb 6, 2017

stweil commented Feb 6, 2017 • edited Loading

theraysmith commented Feb 7, 2017

stweil commented Feb 7, 2017 • edited Loading

jbreiden commented Feb 7, 2017

stweil commented Feb 7, 2017 • edited Loading

stweil commented Feb 7, 2017

amitdo commented Feb 7, 2017

stweil commented Feb 7, 2017

amitdo commented Feb 7, 2017

stweil commented Feb 7, 2017

amitdo commented Feb 7, 2017

amitdo commented Feb 7, 2017

theraysmith commented Feb 7, 2017

stweil commented Feb 7, 2017 • edited Loading

jbreiden commented Feb 7, 2017

amitdo commented Feb 7, 2017

stweil commented Feb 7, 2017

theraysmith commented Feb 8, 2017

amitdo commented Feb 8, 2017

rfschtkt commented Apr 27, 2017

rfschtkt commented May 1, 2017 • edited Loading

stweil commented May 1, 2017

rfschtkt commented May 1, 2017 • edited Loading

theraysmith commented May 3, 2017

stweil commented May 4, 2017

theraysmith commented May 4, 2017 via email

stweil commented Apr 29, 2018

stweil commented May 4, 2018

stweil commented Feb 5, 2017 •

edited

Loading

stweil commented Feb 6, 2017 •

edited

Loading

stweil commented Feb 7, 2017 •

edited

Loading

stweil commented Feb 7, 2017 •

edited

Loading

stweil commented Feb 7, 2017 •

edited

Loading

rfschtkt commented May 1, 2017 •

edited

Loading

rfschtkt commented May 1, 2017 •

edited

Loading