Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlight OCR #1580

Open
wgilling opened this issue Aug 12, 2020 · 5 comments
Open

Highlight OCR #1580

wgilling opened this issue Aug 12, 2020 · 5 comments
Labels
Type: enhancement Identifies work on an enhancement to the Islandora codebase

Comments

@wgilling
Copy link
Contributor

Tesseract can be used to make HOCR again, but there are many challenges to display this.

This likely does not apply to objects that would be viewed using the PDFjs viewer because that handles search highlighting.

  1. editing OCR would be so much more tricky -- considering that the HOCR file would need to potentially be updated as well
  2. displaying the rectangles per search term was a much easier concept for the HTML via the CSS classes in the HOCR file, but a challenge would be how to use these rectangles to make the overlays in the OpenSeadragon viewer.
  3. make a corresponding actions trigger that can be used to generate to any objects that have already been ingested (similarly to the action to "Index node in Fedora")
@jasonhildebrand
Copy link

I understand that Islandora 8 does not support the ability to highlight search results when using openseadragon. I'm contributing our use case in the hopes that this feature will be prioritized soon.

In our case, we are digitizing PDF files using Abbyy Finereader, which supports OCR of German Gothic script (fraktur). It produces PDF files containing the scanned image, as well as the OCR'd text in a separate layer. You can open one of these files in a PDF reader and search it, and it will correctly highlight the location of the matching text.

When we import into Islandora 8, the PDF is converted to a service image, and this is displayed using openseadragon.
To support our use case, I suppose that Islandora would need to determine the location of matched text using the uploaded PDF (since this information is not contained in the JPG service file), then produce overlay information for openseadragon.

@seth-shaw-asu
Copy link
Member

The Islandora-Lite folks @ the University of Toronto Scarborough (tagging @kstapelfeldt and @Natkeeran) did a demonstration of their setup during IslandoraCon 2022 which included improvements in viewer-supported OCR. I believe they were using annotations served via IIIF, but I don't recall details. I look forward to watching their presentation again when it gets posted.

@Natkeeran
Copy link
Contributor

To clarify, it is an early prototype. Please see additional info here: https://github.com/digitalutsc/islandora_lite_docs/wiki/Mirador-Search-and-Annotations-(Prototype)

@alxp (UPEI) is also looking into this feature.

@wgilling
Copy link
Contributor Author

wgilling commented Aug 24, 2022

I'd love to first explore the Mirador Search and Annotations (Prototype) and work with @alxp on this solution since it seems like anybody who is using mirador already would be able to use this.

Jordan Dukart had referenced the mirador-textoverlay code here https://github.com/dbmdz/mirador-textoverlay and said that this was what UTSC and CMU were using, but mirador likely does not take an HOCR file per page but rather an intermediate format.

Also, Don Richards mentioned this https://dbmdz.github.io/solr-ocrhighlighting/0.8.1/ while he was researching the topic.

@jasonhildebrand
Copy link

jasonhildebrand commented Oct 31, 2022

FYI, we have implemented a solution to our use case which I noted earlier. Here is our approach at a high-level:

  • implemented a microservice which accepts a PDF URL (of a PDF in Fedora) and search terms as input, then extracts text from the PDF file and returns bounding boxes of matching terms. We implement some fuzziness using in order to match words which don't match 100% (because we SOLR may be doing word-stemming, etc.).
  • customized islandora/openseadragon to query the microservice, obtain locations of matching terms, and create overlays to highlight the terms
  • we implemented a REST view in Islandora that, given a node of model = Page, allows us to fetch the URLs of the original PDF file of that node. Our openseadragon customization uses this info.
  • small customization to Islandora to retain search terms in the query string when clicking on a search result

This approach was driven largely by the format of our source PDFs (and the need to complete our project on-budget). I don't know whether it is of interest to the Islandora community or not, but thought I would post here in case anyone is interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: enhancement Identifies work on an enhancement to the Islandora codebase
Projects
Development

No branches or pull requests

5 participants