Possible small bug in `SaxPageHandler_Hocr.java` #17

sadra-barikbin · 2024-01-14T13:56:50Z

Hi there!
I was using UB-Mannheim's ocr-fileformat to convert hOCR to PAGE XML. It internally uses PRImA.PageConverter to do the job which itself depends on this java library. It succesfully converted the file except that it put an extra quotation mark next to the image file name in the PAGE XML output causing mismatch between XML file and the image:

(Note the extra quotation mark below)

I delved into the problem and it turned out that when there's no seperator in the file name, the hOCR handler extracts it incorrectly.

prima-core-libs/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Hocr.java

Lines 311 to 325 in 1bdcc57

    
           if (part.startsWith("image")) { 
        
           	String image = null; 
        
           	//Filename 
        
           	// Path 
        
           	if (part.contains(File.separator)) 
        
           		image = part.substring(part.lastIndexOf(File.separator)+1); 
        
           	// No path 
        
           	else if (part.contains(" \"")) 
        
           		image = part.substring(part.indexOf(" \"")+1); 
        
           	if (image != null) { 
        
           		//Remove quotation mark 
        
           		if (image.endsWith("\"")) 
        
           			image = image.substring(0, image.length()-1); 
        
           		page.setImageFilename(image);

Above, the line 319 should become

            image = part.substring(part.indexOf(" \"")+2);

to fix the issue because part.indexOf(" \"") returns the index of space character not the ".

@stweil @bertsky

The text was updated successfully, but these errors were encountered:

stweil · 2024-02-06T17:07:33Z

That sounds very reasonable. @sadra-barikbin, do you want to provide a pull request with your fix for the prima-core-libs.

stweil · 2024-02-06T17:14:04Z

@chris1010010, is this repository still maintained?

chris1010010 · 2024-02-06T17:17:08Z

Currently I don't have time, unfortunately. C.

…

On Tue, 6 Feb 2024, 17:14 Stefan Weil, ***@***.***> wrote: @chris1010010 <https://github.com/chris1010010>, is this repository still maintained? — Reply to this email directly, view it on GitHub <#17 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4A5RRAIZTGRYO7V254PL3YSJQGRAVCNFSM6AAAAABB2D2OJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZQGM4TMMJTGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

This patch fixes an issue in PrimaDla.jar which is used by JPageConverter, see PRImA-Research-Lab#17. Reported-by: Sadra Barikbin <[email protected]> Signed-off-by: Stefan Weil <[email protected]>

stweil · 2024-02-07T20:02:36Z

@sadra-barikbin, I applied your fix in https://github.com/UB-Mannheim/prima-core-libs/releases/tag/1.5.02 with commit UB-Mannheim@3868892. Based on that, I created https://github.com/UB-Mannheim/prima-page-converter/releases/tag/1.5.06 with a fixed JPageConverter.

Our latest ocr-fileformat now uses that fixed JPageConverter.

bertsky · 2024-02-07T20:10:39Z

@stweil 👍 on shipping your own build.

You may want to consider also merging #16 into your prima-core-libs build and make another release. This addresses the frequent problem that parsing fails for some reason, but that reason does not get shown to the user.

stweil · 2024-02-07T21:17:32Z

Good idea. Maybe I can do that tomorrow.

stweil · 2024-02-08T21:27:27Z

Done, see new release 1.5.03. Please note that I have not tested it yet. As soon as testing is done, new releases for dependent repositories can be made.

bertsky · 2024-02-08T22:16:18Z

For testing, the simplest would be to create examples by "breaking" files drawn from some public GT. Like the two causes I described. But you can throw in other errors for good measure (like empty ReadingOrder or empty Unicode or conflicting segment @id or conflicting TextEquiv/@index, or even plain schema invalidities or even XML invalidities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible small bug in `SaxPageHandler_Hocr.java` #17

Possible small bug in `SaxPageHandler_Hocr.java` #17

sadra-barikbin commented Jan 14, 2024 •

edited

Loading

stweil commented Feb 6, 2024

stweil commented Feb 6, 2024

chris1010010 commented Feb 6, 2024 via email

stweil commented Feb 7, 2024

bertsky commented Feb 7, 2024

stweil commented Feb 7, 2024

stweil commented Feb 8, 2024 •

edited

Loading

bertsky commented Feb 8, 2024

Possible small bug in SaxPageHandler_Hocr.java #17

Possible small bug in SaxPageHandler_Hocr.java #17

Comments

sadra-barikbin commented Jan 14, 2024 • edited Loading

stweil commented Feb 6, 2024

stweil commented Feb 6, 2024

chris1010010 commented Feb 6, 2024 via email

stweil commented Feb 7, 2024

bertsky commented Feb 7, 2024

stweil commented Feb 7, 2024

stweil commented Feb 8, 2024 • edited Loading

bertsky commented Feb 8, 2024

Possible small bug in `SaxPageHandler_Hocr.java` #17

Possible small bug in `SaxPageHandler_Hocr.java` #17

sadra-barikbin commented Jan 14, 2024 •

edited

Loading

stweil commented Feb 8, 2024 •

edited

Loading