Skip to content

[6.1] Smart Search: Allow to parse different formats#43397

Open
Hackwar wants to merge 1 commit intojoomla:6.1-devfrom
Hackwar:5.2-finder-format
Open

[6.1] Smart Search: Allow to parse different formats#43397
Hackwar wants to merge 1 commit intojoomla:6.1-devfrom
Hackwar:5.2-finder-format

Conversation

@Hackwar
Copy link
Member

@Hackwar Hackwar commented Apr 28, 2024

Summary of Changes

Smart Search has been donated at some point to the project by a commercial company, but I'm pretty sure that the commercial company wasn't the original developer of the system. There are things in the system which feel like it is the child of at least 2 completely different developers, which resulted in inconsistencies which partially haven't been solved even today. One of them is the support for different parsers for the content to index.

The indexer class supports a method parameter to select different parsers for the content, so that you could for example parse plain text, html, RTF documents or basically everything else you can think of. However, this parameter applies to all properties of a Result object, which is a problem when you have HTML content in a description for example and a PDF (or RTF) in another property. (Think about a document manager.)

This PR implements a new parameter to the Result::addInstruction() method to select a parser to read the property with. Right now this parameter supports txt, html and rtf, but additional parsers for example for PDF or docx are possible. (Especially for docx it should be considered if this has to be part of Joomla core. I would be happy with just PDF for now.) This PR also fixes an issue where the memory_table_limit seems to have been reverted to a wrong value during an upmerge and it raises the chunk size for reading data from 2KiB to 32KiB. While I would even question if 2KiB would have been the right value in 2012 when this was added to Joomla, going to 32KiB today is still playing this VERY safe. However, cutting it up into such small chunks also means that all the rest of the code is run more often than necessary, reducing performance.

The code is backwards compatible and when the index() method is called with a $format parameter, that parameter takes precedence over the set instructions, expecting this to be legacy code which would be unaware of this new feature.

Testing Instructions

Please find attached a testing plugin for Smart Search, which adds one entry to the index and reads an RTF file into the system while doing so. You need to get your own RTF sample file. Extract the attached ZIP to your /plugins/finder folder and discover the plugin in the backend. Make sure that you have enabled the plugin. Edit /plugins/finder/test/src/Extension/Test.php and add your demo RTF file, which you want to index in line 141. It is trying to load the filepath from the root of the site. Then click Index in the Smart Search backend. Afterwards you can search for the content of the RTF in the frontend and should get an entry named Test RTF when it matches.
test.zip

Documentation will be added soon.

Link to documentations

Please select:

  • Documentation link for docs.joomla.org:

  • No documentation changes for docs.joomla.org needed

  • Pull Request link for manual.joomla.org:

  • No documentation changes for manual.joomla.org needed

@brianteeman
Copy link
Contributor

A good source for a sample rtf file is https://file-examples.com/index.php/sample-documents-download/sample-rtf-download/

@Hackwar
Copy link
Member Author

Hackwar commented Apr 28, 2024

Not exactly. That site does contain RTF files, but they are all just lorem ipsum text. I looked for some public domain books to parse here and came across (obscure versions of) the bible and finally settled on War and Peace from Tolstoy. I didn't list a source for RTF files because I didn't just want several people to test this with just one specific file.

@brianteeman
Copy link
Contributor

Struggling to see why this should be added to the core

@Hackwar
Copy link
Member Author

Hackwar commented Apr 28, 2024

Because it has been part of core since 2.5.0, just it was broken all the time.

@brianteeman
Copy link
Contributor

Surely thats an indicator that it should be removed if anything at all

@HLeithner HLeithner changed the base branch from 5.2-dev to 5.3-dev September 2, 2024 08:51
@HLeithner
Copy link
Member

This pull request has been automatically rebased to 5.3-dev.

@HLeithner HLeithner changed the title [5.2] Smart Search: Allow to parse different formats [5.3] Smart Search: Allow to parse different formats Sep 2, 2024
@Hackwar Hackwar removed the PR-5.2-dev label Sep 3, 2024
@shrutidhole123
Copy link

It is showing error on finder_test plugin.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/43397.

@shrutidhole123
Copy link

I have tested this item 🔴 unsuccessfully on d8bb755


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/43397.

@VaishnaviSidral
Copy link

I have tested this item 🔴 unsuccessfully on d8bb755

The RTF file is currently being downloaded when clicked instead of being displayed inline in the frontend.
This is due to the URL handling and needs to be adjusted for proper content display.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/43397.

@exlemor
Copy link

exlemor commented Feb 22, 2025

Hi @shrutidhole123 and @VaishnaviSidral - thank you for your contributions in testing - it would be great / best if you could join Mattermost and the OR PBF channel because we would love to provide guidelines and testing instructions...

If you need help in that process you can follow this video: https://www.youtube.com/watch?v=FOTluVIHcag

@gacompa
Copy link

gacompa commented Feb 22, 2025

I have tested this item ✅ successfully on d8bb755

I tried successfully the actions described, but not sure this is the full expected behaviour.

I tested with an external RTF document loaded as test (a novel in spanish by chilean writer Francisco Coloane) and had problems with the wording Los Conquistadores de la Antártida which I kept by chance. I could find most of the words but not Antártida I suspected it was due to the accent. I tried to install and search in spanish language without success. No result.
I pasted the text as it is in an internal article (spanish lang) and it did work, I found everything in spanish search. No result in other languages as expected.
Just for confirmation I created an article with different italian words with accent, I saved in RTF and also pasted in an article (italian language) and repeated the test. Words are found in the article but not in the actual RTF file indexed with the instructions and the test plugin.

Then I checked in the actual RTF file and found (as expected) that the accent is mapped: e.g. Antártida -> Ant'e1rtida
Searching for Ant'e1rtida as plain text did work and it was found.

Finally
When the text is found in an article the link gets to the article.
When the text is found in an external file (the one used for testing, according to instructions) the link provides a download action.

Thanks


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/43397.

@HLeithner HLeithner changed the base branch from 5.3-dev to 6.0-dev March 4, 2025 17:20
@HLeithner
Copy link
Member

This pull request has been automatically rebased to 6.0-dev.

@HLeithner HLeithner changed the title [5.3] Smart Search: Allow to parse different formats [6.0] Smart Search: Allow to parse different formats Mar 4, 2025
@rdeutz rdeutz removed the PR-5.3-dev label Mar 5, 2025
@exlemor
Copy link

exlemor commented Mar 8, 2025

I have tested this item 🔴 unsuccessfully on d8bb755

I have tested this partially successfully which is why I had to label it as unsuccessful ;(

The mechanism works (reading the .RTF file) why I had to label it unsuccessful is because while many words were detected many were NOT detected such as:

puzzles, (even tried puzzles)
fac-simile, (also tried fac-simile)
Congress— (also tried Congress)
etc...


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/43397.

This is the file that I used for the test:

pg2148.rtf.txt

(it's actually a .TXT file but it seems GitHub doesn't allow .RTF files so I added .txt extension for it to upload)

@HLeithner HLeithner changed the base branch from 6.0-dev to 6.1-dev August 31, 2025 11:59
@HLeithner
Copy link
Member

This pull request has been automatically rebased to 6.1-dev.

@HLeithner HLeithner changed the title [6.0] Smart Search: Allow to parse different formats [6.1] Smart Search: Allow to parse different formats Aug 31, 2025
@PranavAgarkar07
Copy link

I have tested this item 🔴 unsuccessfully on d8bb755

I have tested this item ❌ unsuccessfully on d8bb755

Testing confirms previous reports: test plugin installation errors and RTF file display issues. Files download instead of content being indexed/displayed properly.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/43397.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants