Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #941: Only add <br/> tags to plain text extracted text fields. #942

Merged
merged 4 commits into from
May 3, 2023

Conversation

alxp
Copy link

@alxp alxp commented May 2, 2023

GitHub Issue: [BUG] islandora_text_extraction inserting
tags even when it shouldn't

What does this Pull Request do?

Checks the text format of an incoming extracted text field value, and only adds br tags if it is plain text.

What's new?

  • Does this change add any new dependencies? No
  • Does this change require any other modifications to be made to the repository
    (i.e. Regeneration activity, etc.)? No
  • Could this change impact execution of existing code? No

How should this be tested?

Set up a media such a a File that can take a TIFF being uploaded.

Add a field with type Text (Long) (not plain).

Create a custom action based on the text extraction action type. Pick the field you created just now, and select hOCR as the text format.

Add a context action so this action is fired when the media is uploaded.

Create a Page node and add a media of the type you created above. Upload a TIFF file with embedded text.

Observe that no extra br tags are inserted into the field when the action is fired after saving the media.

Documentation Status

  • Does this change existing behaviour that's currently documented? No
  • Does this change require new pages or sections of documentation? No
  • Who does this need to be documented for? N/A
  • Associated documentation pull request(s): ___ or documentation issue ___

Additional Notes:

Any additional information that you think would be helpful when reviewing this
PR.

Interested parties

Tag (@ mention) interested parties or, if unsure, @Islandora/committers
@ajstanley

@ajstanley
Copy link

The <br> tags are also added here to run when the transcript is uploaded manually. They were added to make the text a bit more readable for editing, OCR often just being a good guess, with tons of donkey-work to get it all cleaned up.

If we still want this feature it would be pretty easy to create a second hidden text field with tags removed for the sole purpose of indexing and discovery, and update it with the presave_media_hook.

@alxp
Copy link
Author

alxp commented May 3, 2023

The <br> tags are also added here to run when the transcript is uploaded manually. They were added to make the text a bit more readable for editing, OCR often just being a good guess, with tons of donkey-work to get it all cleaned up.

If we still want this feature it would be pretty easy to create a second hidden text field with tags removed for the sole purpose of indexing and discovery, and update it with the presave_media_hook.

Since that hook only inserted tags into a field with the specific field name I think it's OK for now to leave it in along with this PR change. Perhaps an text input filter would be a more elegant way to put them in for user-editable fields.

@rosiel rosiel self-assigned this May 3, 2023
@rosiel
Copy link
Member

rosiel commented May 3, 2023

Tested, and it works as advertised. Note that this only affects "as Media Attachment" (i.e. "multifile media" derivatives, those stored on the original file's media itself).

During testing I found an unrelated bug such that when creating a "Generate Extracted Text For Media Attachment" Action, it only works when you select hOCR as your text format. However this PR does not affect that.

@rosiel rosiel merged commit e4fbbb3 into 2.x May 3, 2023
@rosiel rosiel deleted the issue-941-text-extract-br-tags branch May 3, 2023 19:42
rosiel pushed a commit that referenced this pull request Jul 6, 2023
…942)

* Issue #941: Only add <br/> tags to plain text extracted text fields.

* Fix PHPCS errors.

* Don't add <br /> tags to edited OCR text field if it looks like hOCR.

* Respond to PHPCS errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants