Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes Extracted Text not indexing in solr #767

Merged
merged 1 commit into from
Apr 1, 2020
Merged

Conversation

dannylamb
Copy link

@dannylamb dannylamb commented Mar 30, 2020

JIRA Ticket: Resolves Islandora/documentation#1476

What does this Pull Request do?

Manually triggers search re-indexing for nodes when their extracted text media are inserted or updated.

What's new?

Brute forcing re-indexing in media insert and update hooks.

How should this be tested?

To confirm

  • Ingest some paged content
  • Go to a page and look at its extracted text file
  • Search for something in the text
  • Should get no results

Now apply this PR

  • Ingest a new page
  • Go to the page and look at its extracted text file
  • Search for something in the text
  • Should get results!

Interested parties

@Islandora/8-x-committers

@kayakr
Copy link

kayakr commented Mar 30, 2020

@dannylamb I noticed this issue last night also, so great to have a patch to try. I've applied the patch via composer and it works for a fresh new Repository Item but when I tried media update (e.g. by removing the PDF and adding a different PDF) I got a new Thumbnail Image but the extracted_text is still the text from the old file...

@dannylamb
Copy link
Author

@kayakr Thanks so much for looking at this. I've pushed up a tidier version of what I had before. But hold up on testing it. I'll walk through those steps and see why the extracted text isn't updating when replacing a file. If I can fix it here I will.

@dannylamb dannylamb changed the title Update islandora_text_extraction.module Fixes Extracted Text not indexing in solr Mar 31, 2020
@dannylamb
Copy link
Author

So followed your steps and uploaded a new file to the "Original File" media, and got a new thumbnail but no extracted text or technical metadata. Extracted text had nothing in the logs, but I did get this for the fits data:

Message History
---------------------------------------------------------------------------------------------------------------------------------------
RouteId              ProcessorId          Processor                                                                        Elapsed (ms)
[IslandoraConnector] [IslandoraConnector] [activemq://queue:islandora-connector-fits                                     ] [     41196]
[IslandoraConnector] [unmarshal9        ] [unmarshal[org.apache.camel.model.dataformat.JsonDataFormat@76ae54a5]          ] [         1]
[IslandoraConnector] [setProperty24     ] [setProperty[event]                                                            ] [         0]
[IslandoraConnector] [removeHeaders14   ] [removeHeaders[*]                                                              ] [         0]
[IslandoraConnector] [setHeader37       ] [setHeader[CamelHttpMethod]                                                    ] [         0]
[IslandoraConnector] [setHeader38       ] [setHeader[Accept]                                                             ] [         0]
[IslandoraConnector] [setHeader39       ] [setHeader[X-Islandora-Args]                                                   ] [         1]
[IslandoraConnector] [setHeader40       ] [setHeader[Apix-Ldp-Resource]                                                  ] [         0]
[IslandoraConnector] [setBody10         ] [setBody[simple{Simple: ${null}}]                                              ] [         0]
[IslandoraConnector] [to18              ] [{{derivative.service.url}}?connectionClose=true                               ] [     31165]
[IslandoraConnector] [removeHeaders15   ] [removeHeaders[*]                                                              ] [         0]
[IslandoraConnector] [setHeader41       ] [setHeader[Content-Location]                                                   ] [         0]
[IslandoraConnector] [setHeader42       ] [setHeader[CamelHttpMethod]                                                    ] [         0]
[IslandoraConnector] [toD11             ] [                                                                              ] [     10029]
[IslandoraConnector] [log20             ] [log                                                                           ] [         2]

Stacktrace
---------------------------------------------------------------------------------------------------------------------------------------
org.apache.camel.http.common.HttpOperationFailedException: HTTP operation failed invoking http://localhost:8000/node/13/media/fits_technical_metadata/1 with statusCode: 400

Drupal logs are showing

Symfony\Component\HttpKernel\Exception\HttpException: 'field_config' entity with ID 'media.fits_technical_metadata.fits_jhove_creating_application_' already exists. in Drupal\islandora\Controller\MediaSourceController->putToNode() (line 191 of /var/www/html/drupal/web/modules/contrib/islandora/src/Controller/MediaSourceController.php).

@dannylamb
Copy link
Author

After debugging this pretty hard and running into everything from file permissions issues (www-data couldn't write sites/default/files), to transaction issues

Symfony\Component\HttpKernel\Exception\HttpException: SQLSTATE[42000]: Syntax error or access violation: 1305 SAVEPOINT savepoint_1 does not exist: ROLLBACK TO SAVEPOINT savepoint_1; Array ( ) in Drupal\islandora\Controller\MediaSourceController->putToNode() (line 191 of /var/www/html/drupal/web/modules/contrib/islandora/src/Controller/MediaSourceController.php).

At this point i'm going to rebuild my environment and try again to see if I can isolate the issues. But one thing's for sure, you upload a new media and things go haywire.

It would be nice if another @Islandora/8-x-committers can try and recreate as well. Just for basic sanity...

@dannylamb dannylamb modified the milestone: 1.1.0 Apr 1, 2020
@seth-shaw-unlv
Copy link

I'll give it a spin.

Copy link

@seth-shaw-unlv seth-shaw-unlv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I pulled in the PR and it works! Granted, I hit the FITS WSOD issue, and the logs are full of errors from the JsonldTypeAlterReaction.... but this PR works as advertised and isn't the source of those errors.

@seth-shaw-unlv seth-shaw-unlv merged commit 029bbcf into 8.x-1.x Apr 1, 2020
@seth-shaw-unlv seth-shaw-unlv deleted the issue-1476 branch April 1, 2020 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants