Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search has issues with words adjacent to puncutation characters #2095

Closed
Knaui opened this issue May 5, 2020 · 9 comments
Closed

Search has issues with words adjacent to puncutation characters #2095

Knaui opened this issue May 5, 2020 · 9 comments

Comments

@Knaui
Copy link

Knaui commented May 5, 2020

for example: it wont find "house" in "big-house"

but it will find "big"

this is the case for book or page titles and for page content

tested with BookStack version v0.29.0

@kshitijsharma97
Copy link

kshitijsharma97 commented Aug 4, 2020

I also tried same ting in my dev instance I had the same issue.
If I try the 1st word then the page will come but if I try the whole name with hyphen nothing came in search result.

But if you pull the changes and update the version to BookStack v0.29.3.
The issue with this hyphen separated search is resolved.

@ssddanbrown ssddanbrown changed the title search does not find words after a hyphen Search does not find words adjacent to puncutation characters Jul 12, 2021
@ssddanbrown
Copy link
Member

Updating the title to be more generic in the interest of merging down some issues.

Related to #1037

@ssddanbrown ssddanbrown changed the title Search does not find words adjacent to puncutation characters Search has issues with words adjacent to puncutation characters Jul 12, 2021
@Wookbert
Copy link

Wookbert commented Jul 23, 2021

@ssddanbrown

I’ve just realized that searching word parts which are combined through hyphens, doesn't work either.

Example: Searching for historian does not find the page on CCU-Historian, while searching for ccu does. Note that hyphens are a very common element in for instance German language. You often have word combinations which are connected through 2 or even 3 hyphens.

An english language example would be Remote-robot-assisted, which IMO should be retrieved when searching for any of the three words individually, but also e.g. robot-assisted, robot assisted or robotassisted. (Same applies for any spelling of the Remote robot combination).

@dweinerATL
Copy link

@ssddanbrown we are running into something similar. Running BookStack v21.05.4 for a science fiction authors book series. One of her races are called Ke!endarian. If you search for Ke!endarian, no results. If you search for Kel, you get the expected response. We have found that the search will work if you search for "Ke!endarian" however.

@ssddanbrown
Copy link
Member

As part of #3043 I've made a change to auto-convert any search terms, that would experience this issue, into exact match terms instead which will run a direct, although less efficient, content match. Doesn't directly solve this but should provide a much better user-experience in such situations. Will be part of the next feature release.

@caius-martinus
Copy link

Hello @ssddanbrown,
I think issue isn't solved at least in 23.08.2, here is how to reproduce: create a page with the content /abc123 on a single line. Now search abc1 and you should observe it doesn't match. However /abc1 would.

@sNiXx
Copy link

sNiXx commented Nov 20, 2023

I can confirm this issue is still present on 23.10.2. I also just verified on the demo instance (currently 23.10.4) and hyphenated words are not correctly found. For instance, the pages prod-linode-sparkjet or dev-internal-sparklebike on the demo instance cannot be found if the last term (i.e. sparkjet or sparklebike) is used to search.

@watschi
Copy link

watschi commented Jan 8, 2025

Facing the same issue with hyphenated words, which are pretty common in german text.
Quick and dirty solution (needs to be applied after any update):

  • Edit app/Search/SearchIndex.php, add a hyphen (-) to $delimiters (at Link)
  • Run php artisan bookstack:regenerate-search
  • For the word Test-Word, Test, Word and Test-Word will return the desired content

@ssddanbrown Any reason to exclude - from the delimiters? Feels like this should be included by default, maybe it's an oversight, maybe I'm missing something 🙂

ssddanbrown added a commit that referenced this issue Feb 14, 2025
This changes indexing so that a.b now indexes as "a", "b" AND "a.b"
instead of just the first two, for periods and hypens, so terms
containing those characters can be searched within.

Adds hypens as a delimiter - #2095
@ssddanbrown ssddanbrown added this to the Next Feature Release milestone Feb 14, 2025
@ssddanbrown
Copy link
Member

@watschi

Any reason to exclude - from the delimiters? Feels like this should be included by default, maybe it's an oversight, maybe I'm missing something 🙂

Really it was because they felt more part of a term rather than something to split them by, but I can see the issue that would result.

I spent some time on this today to change up the indexing a bit via #5488.
I've tried to come to a compromise to help address some of the most problematic areas, in addition to adding - as a delimiter.
Now, for the text cat-dog BookStack will now index that as cat, dog and cat-dog.
That way, searching for either work will work but the full term will also work via our proper indexed term system.
The same is done for dots/periods (which I thought could be important for numbering among other things).

There will still be gaps and limitations in search due to the nature of the trying to keep content indexed, using prefix matching, and the use of custom tokenization, but this should solve some of the most common issues here reported about hyphenated words.
Therefore I'm going to close this off but new focus areas can be raised as needed (If not already open).

The mentioned changes will be part of the next feature release.
Note, that you'd need to regenerate the search index after updating to gain these index improvements.

Thanks all for your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

8 participants