Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible small bug in SaxPageHandler_Hocr.java #17

Open
sadra-barikbin opened this issue Jan 14, 2024 · 8 comments
Open

Possible small bug in SaxPageHandler_Hocr.java #17

sadra-barikbin opened this issue Jan 14, 2024 · 8 comments

Comments

@sadra-barikbin
Copy link

sadra-barikbin commented Jan 14, 2024

Hi there!
I was using UB-Mannheim's ocr-fileformat to convert hOCR to PAGE XML. It internally uses PRImA.PageConverter to do the job which itself depends on this java library. It succesfully converted the file except that it put an extra quotation mark next to the image file name in the PAGE XML output causing mismatch between XML file and the image:
image
(Note the extra quotation mark below)
image

I delved into the problem and it turned out that when there's no seperator in the file name, the hOCR handler extracts it incorrectly.

if (part.startsWith("image")) {
String image = null;
//Filename
// Path
if (part.contains(File.separator))
image = part.substring(part.lastIndexOf(File.separator)+1);
// No path
else if (part.contains(" \""))
image = part.substring(part.indexOf(" \"")+1);
if (image != null) {
//Remove quotation mark
if (image.endsWith("\""))
image = image.substring(0, image.length()-1);
page.setImageFilename(image);

Above, the line 319 should become

            image = part.substring(part.indexOf(" \"")+2); 

to fix the issue because part.indexOf(" \"") returns the index of space character not the ".

@stweil @bertsky

@stweil
Copy link
Contributor

stweil commented Feb 6, 2024

That sounds very reasonable. @sadra-barikbin, do you want to provide a pull request with your fix for the prima-core-libs.

@stweil
Copy link
Contributor

stweil commented Feb 6, 2024

@chris1010010, is this repository still maintained?

@chris1010010
Copy link
Contributor

chris1010010 commented Feb 6, 2024 via email

stweil added a commit to UB-Mannheim/prima-core-libs that referenced this issue Feb 7, 2024
This patch fixes an issue in PrimaDla.jar which is used by JPageConverter,
see PRImA-Research-Lab#17.

Reported-by: Sadra Barikbin <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
@stweil
Copy link
Contributor

stweil commented Feb 7, 2024

@sadra-barikbin, I applied your fix in https://github.com/UB-Mannheim/prima-core-libs/releases/tag/1.5.02 with commit UB-Mannheim@3868892. Based on that, I created https://github.com/UB-Mannheim/prima-page-converter/releases/tag/1.5.06 with a fixed JPageConverter.

Our latest ocr-fileformat now uses that fixed JPageConverter.

@bertsky
Copy link
Contributor

bertsky commented Feb 7, 2024

@stweil 👍 on shipping your own build.

You may want to consider also merging #16 into your prima-core-libs build and make another release. This addresses the frequent problem that parsing fails for some reason, but that reason does not get shown to the user.

@stweil
Copy link
Contributor

stweil commented Feb 7, 2024

Good idea. Maybe I can do that tomorrow.

@stweil
Copy link
Contributor

stweil commented Feb 8, 2024

Done, see new release 1.5.03. Please note that I have not tested it yet. As soon as testing is done, new releases for dependent repositories can be made.

@bertsky
Copy link
Contributor

bertsky commented Feb 8, 2024

For testing, the simplest would be to create examples by "breaking" files drawn from some public GT. Like the two causes I described. But you can throw in other errors for good measure (like empty ReadingOrder or empty Unicode or conflicting segment @id or conflicting TextEquiv/@index, or even plain schema invalidities or even XML invalidities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants