Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No support to UNLV files #10

Closed
brnrc opened this issue Feb 11, 2015 · 10 comments
Closed

No support to UNLV files #10

brnrc opened this issue Feb 11, 2015 · 10 comments

Comments

@brnrc
Copy link

brnrc commented Feb 11, 2015

When I run tesseract via cli, it automatically uses any .uzn file with the same name of the target image.

Example:
tesseract workingimage004.png output.txt -psm 4

If I have a workingimage004.uzn file in the same location of my workingimage004.png it will use it.

Tess4J ignores the uzn file.

Here is my code:

        Tesseract instance = Tesseract.getInstance();
        //In case you don't have your own tessdata, let it also be extracted for you
        File tessDataFolder = LoadLibs.extractTessResources("tessdata");

        //Set the tessdata path
        instance.setDatapath(tessDataFolder.getAbsolutePath());

        instance.setPageSegMode(4);
        try {
            String content = instance.doOCR(page);
        } catch (TesseractException e) {
            e.printStackTrace();
        }

Additional info:
Found the function that read and segments the image using a file UNLV at ccstruct\blread.cpp:36 (on tesseract source)

@brnrc brnrc changed the title No support to UZN files No support to UNLV files Feb 11, 2015
@4F2E4A2E
Copy link
Collaborator

Please provide the exact mimetype and a test file, thank you.

@brnrc
Copy link
Author

brnrc commented Feb 11, 2015

image: https://www.dropbox.com/s/dtb1ra7jqdk9shw/workingimage004.png
uzn: https://www.dropbox.com/s/o2j6atqnwqx7kir/workingimage004.uzn

Run with: tesseract workingimage004.png manual_ocr.txt -psm 4
the output should be

ANNEX TO THE EUROPEAN SEARCH REPORT
ON EUROPEAN PATENT APPLICATION NO.

@brnrc
Copy link
Author

brnrc commented Feb 11, 2015

Hey @4F2E4A2E,
Just a friendly question, are you checking on this?
I just want to know if should I wait or try a different solution! Thanks

@4F2E4A2E
Copy link
Collaborator

Yes i am, pls get online in skype, thx.

@4F2E4A2E
Copy link
Collaborator

Something is not right:

  • the png file provided have a lot more content then two lines
  • the uzn file provided seams to be empty or with just 26Bytes
  • you did not provide the mimetype

Help me out, this kind of format is new to me, in order to help i must be able to be 100% sure of the data integrity of those files and i need to be able to produce the uzn files, thanks.

@4F2E4A2E
Copy link
Collaborator

@4F2E4A2E
Copy link
Collaborator

Thanks brucardoso2 for introducing me to uzn.
Our lib does support it, but not automated, but it will in our next release. Untill then just parse your uzn and populate the java.awt.Rectangle. Here an example without reading in the uzn file and parsing it:

File imageFile = new File("workingimage004.png");
Tesseract instance = Tesseract.getInstance();  // JNA Interface Mapping
File tessDataFolder = LoadLibs.extractTessResources("tessdata");
instance.setDatapath(tessDataFolder.getAbsolutePath());

//parse the uzn file and create a rectangle obj
Rectangle rect = new Rectangle(600, 213, 1142, 240);
    try {
        String result = instance.doOCR(imageFile, rect);
        System.out.println(result);
    } catch (TesseractException e) {
        System.err.println(e.getMessage());
    }
}

@4F2E4A2E
Copy link
Collaborator

A proposal has been created as you can read here: #11
@brucardoso2: can i close this issue?

@brnrc
Copy link
Author

brnrc commented Feb 11, 2015

@4F2E4A2E sure

@4F2E4A2E
Copy link
Collaborator

Closing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants