-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM: Big-endian support #518
Comments
There are different approaches possible to get support for big endian machines:
The current code obviously tries to implement the first variant: it uses I prefer the second variant and suggest to always use little endian training data files. Then the most common little endian platforms can use the data without the need to do byte swaps. Big endian hosts can use fixed code to convert the data when reading or writing them. This results in less complex code: the My @theraysmith, if you agree to switch to that different solution for the endianness problem, I'd continue with the write part. In a first PR I'd remove the current |
How does the big-endian machine know which values are of which sizes, and types and therefore how to swap? Please explain how that works if the swaps are all removed from the deserialize methods. |
That's done by the function |
You said you successfully ran on phototest.tif on a big-endian machine. |
My test used the default (Tesseract + LSTM). My code includes changes to |
Three small points. From a software distribution standpoint, it is pretty burdensome to require Microsoft and Torvalds have almost entirely killed off big endian machines Any efforts on this topic get cut in half if the non-LSTM recognizer is removed. |
I strongly vote against removing non-LSTM as we currently still get better results with it in some cases. Technically it is possible to have BE and LE machines creating different training data as long as both are able to fix that during read. But this is one of the drawbacks: each Tesseract must be able to read both variants of training files (which results in less efficient code). In addition it is more difficult to compare the training output from LE and BE machines. |
@theraysmith, a new test with explicit |
Ray, you can see online Stefan's suggested changes here. |
or here. In the meantime I have improved that experimental code further (based on PR #706) and will send an update later. It still only addresses reading, but adding write support to enforce little endian training data files is rather easy following the same scheme. @jbreiden, did you ever test 3.05 or older versions with big endian machines? If not, I can run a test on my s390x emulation. |
Protocol Buffers uses Stefan's approach: |
I assume that most binary file formats do. See also https://en.wikipedia.org/wiki/Endianness#Files_and_byte_swap. |
True. I mentioned PB because it is written by Google and they use it extensively. |
Another victim is IBM POWER.
|
Please provide examples of where you get better results with the old engine. I disagree with the assessment that protocol buffers use Stefan's method, as I still haven't had an explanation of nor seen code to show how the reading of a little-endian file on a big-endian machine works. This comment on proto buffers "Yep. On the wire, things are encoded little-endian, but the encoding and decoding routines will convert to and from your machine's format themselves, so you don't need to worry about it." sounds exactly like what the code currently does. I don't think having a collection of big-endian data files works if that is the proposal. That would be very ugly. What exactly is the proposal for classes like Matrix? Basically the swaps have to stay in place, but could be predicated on an #ifdef instead of runtime data. Even that would make the code ugly, and I don't see a huge amount of CPU being burnt testing if (swap), so I don't really see what all the fuss is about over code efficiency compared to the wasted effort messing about with the code. In summary, I haven't seen a coherent, convincing argument that there is anything wrong with the current solution of the code only swaps the data if it needs to, which most of the time it doesn't because all data files are little-endian and almost all machines are little-endian. |
I'll do that in the discussion of the new issue #707.
The current implementation uses My experimental implementation removes the The new code simply uses new functions for all reads from file, so getting all cases where swapping is needed is much simpler than in the current implementation. The same can be done on writing to achieve little endian data files, no matter what endianness the host is using. |
No, I don't currently have my hands on a big endian machine. I'm pretty sure that I could get my hands on one, but so far I have not made the effort. As a side note, I'm helping ship Tesseract is on 23 different architectures, some of them big endian. They all share the same data files. However, the big endian platforms are very rarely used and there may not be bug reports, even if totally broken. https://buildd.debian.org/status/package.php?p=tesseract&suite=unstable |
You can use QEMU for testing. |
I now have run a test ( |
OK, now I finally understand your proposal, I like some aspects of it, but I have some suggestions for making it better:
Then I agree it would be cleaner, smaller, and more efficient, as well as future-proofed. |
He did provide a link for 'convert2le' ( |
Little-endian is an abomination. It hurts the brain if you haven't grown up with it (like apparently in Arabic, which still pronounces numbers big to little except for units before tens). UTF-8 is necessarily big-endian by design so that lexicographical order is the same whether processing considers bytes (fast, using memcmp) or code points (slow). Packed bitmap representations like BMP and PNG (I only verified those) store the leftmost pixel in the most significant bits of a byte, so loading a range of pixels in a 64-bit integer, shifting them inside the processor, and writing them back only makes sense if integers are big-endian. Intelligent future civilisations are bound to reconsider the current accidental infatuation with little-endian, and will curse their predecessors for having to modify all those inflexible formats and software. With C++, byte order or pre-evaluated need to reverse order ("bool swap") can easily be hidden in a file object member variable: it does not have to be passed around as a function argument, if that is considered too cumbersome. Has anybody even made a performance analysis before just postulating that little-endian is necessary for optimal or near-optimal performance on little-endian machines, considering that relatively little time is spent on I/O and that data files will also tend to be little-endian? |
I've had a look at stweil's proposal (endian branch), and I'm "not sure" that it will work... unless on a big-endian machine the file is saved twice, or read in again, perhaps because serialisation was the last thing the program did before it was restarted, or otherwise explicitly. The problem seems to be that the byte order is reversed in place, for serialisation as well as deserialisation. If you want to use a fixed endianness, you have to use a separate buffer, or do one of the workarounds described above. And that's if you're sure that each data item is visited exactly once, otherwise the reversal has to occur inside the lowest-level serialisation functions! I would propose a different solution, where serialisation and deserialisation are unified into a single method, and instead of a FILE pointer you get a handle to an object that can do anything you want. This could easily simplify the existing code, without breaking compatibility with existing data files. Another thing is to hide endianness handling in a library that can be reused and is maintained by people who do care about this issue. For example, a 64-bit swap in source code is either recognised by the compiler and optimised into a single instruction, or there is a better alternative with fewer operations. It would also be possible to use vector instructions to convert endianness. |
Well, it works as far as it was implemented and tested. Writing little endian trained data on a big endian host was not implemented. All other use cases (reading little endian on a little or big endian host, writing little endian on a little endian host) work according to my tests. Nevertheless pull request #703 is obsolete, as Ray is currently working on improving endianness support. |
I would use a subset of boost serialisation: just one serialize() method template exactly as specified in boost (instead of largely redundant separate code for writing and reading), and minimal format-compatible ad-hoc Archive implementations without creating any dependency on boost itself or not yet anyway, except for reusing the idea. You can then hide the endianness policy in the Archive implementation: write host endianness (as it is now), write fixed endianness (as proposed above), write configured endianness (host endianness by default, unless perhaps publishing for reuse, or a default the other way around), write JSON, ... Better than reinventing the wheel with probably more code! (2017-05-02 Added) Maybe it's a bit late for exactly that, I don't know... But if you want to use TFile, you should also have it do all the endianness handling, so it's never forgotten. If somebody else weren't already on it, I'd give that a try myself. |
Fixed in commit 8e79297 |
Ray, the new code still uses a dynamic detection in Are you planning more changes? I'd drop support of big endian data files in 4.0 and add code to always write little endian ones. Then static swapping code would only be needed on big endian machines, and the large majority of machines would not need any swap code at all. |
The code that is there now is far simpler and cleaner than anything that
was there before.
While there is still minor overhead in deserializing, it is small, as there
is only one check for each array.
To write only little-endian files would be a lot more work on the dead
code, so it isn't worth it. This was already a lot of work on the dead code
as it is.
…On Wed, May 3, 2017 at 10:02 PM, Stefan Weil ***@***.***> wrote:
Ray, the new code still uses a dynamic detection in
TessdataManager::LoadMemBuffer to decide whether swapping is needed or
not. This implies that the code supports both big and little endian data
files. The drawback is additional runtime code on all kinds of machines.
Are you planning more changes? I'd drop support of big endian data files
in 4.0 and add code to always write little endian ones. Then static
swapping code would only be needed on big endian machines, and the large
majority of machines would not need any swap code at all.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#518 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056UZK6qi93qEj5bdhFUjJ6ULWHoxpks5r2VvVgaJpZM4LBZNs>
.
--
Ray.
|
The latest version of Tesseract crashes on all big endian machines (tested on Debian s390x and mips with the official Debian packages and with git master). See also #1525. |
See commit 21d5ce5 which fixes the crash when running OCR on a big endian machine. |
https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#for-open-source-contributors
The text was updated successfully, but these errors were encountered: