-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible small bug in SaxPageHandler_Hocr.java
#17
Comments
That sounds very reasonable. @sadra-barikbin, do you want to provide a pull request with your fix for the prima-core-libs. |
@chris1010010, is this repository still maintained? |
Currently I don't have time, unfortunately.
C.
…On Tue, 6 Feb 2024, 17:14 Stefan Weil, ***@***.***> wrote:
@chris1010010 <https://github.com/chris1010010>, is this repository still
maintained?
—
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4A5RRAIZTGRYO7V254PL3YSJQGRAVCNFSM6AAAAABB2D2OJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZQGM4TMMJTGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This patch fixes an issue in PrimaDla.jar which is used by JPageConverter, see PRImA-Research-Lab#17. Reported-by: Sadra Barikbin <[email protected]> Signed-off-by: Stefan Weil <[email protected]>
@sadra-barikbin, I applied your fix in https://github.com/UB-Mannheim/prima-core-libs/releases/tag/1.5.02 with commit UB-Mannheim@3868892. Based on that, I created https://github.com/UB-Mannheim/prima-page-converter/releases/tag/1.5.06 with a fixed JPageConverter. Our latest ocr-fileformat now uses that fixed JPageConverter. |
Good idea. Maybe I can do that tomorrow. |
Done, see new release 1.5.03. Please note that I have not tested it yet. As soon as testing is done, new releases for dependent repositories can be made. |
For testing, the simplest would be to create examples by "breaking" files drawn from some public GT. Like the two causes I described. But you can throw in other errors for good measure (like empty |
Hi there!
I was using UB-Mannheim's
ocr-fileformat
to convert hOCR to PAGE XML. It internally uses PRImA.PageConverter to do the job which itself depends on this java library. It succesfully converted the file except that it put an extra quotation mark next to the image file name in the PAGE XML output causing mismatch between XML file and the image:(Note the extra quotation mark below)
I delved into the problem and it turned out that when there's no seperator in the file name, the hOCR handler extracts it incorrectly.
prima-core-libs/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Hocr.java
Lines 311 to 325 in 1bdcc57
Above, the line 319 should become
to fix the issue because
part.indexOf(" \"")
returns the index of space character not the"
.@stweil @bertsky
The text was updated successfully, but these errors were encountered: