Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add segments of words with punctuation to index #225

Closed
bglw opened this issue Feb 16, 2023 · 6 comments
Closed

Add segments of words with punctuation to index #225

bglw opened this issue Feb 16, 2023 · 6 comments
Labels
improvement Not a bug Pagefind CLI The CLI responsible for indexing content
Milestone

Comments

@bglw
Copy link
Contributor

bglw commented Feb 16, 2023

Discussion in #215

Perhaps Pagefind should automatically index a word like color-accent as [color-accent, color, accent]

@bglw bglw added improvement Not a bug Pagefind CLI The CLI responsible for indexing content labels Feb 16, 2023
@bglw bglw added this to the v1.0.0 milestone Mar 1, 2023
@bglw
Copy link
Contributor Author

bglw commented Apr 11, 2023

Addition from #267 — we should ideally apply this to a range of punctuation characters that make sense, so pagefind.toml indexes as [pagefind.toml, pagefind, toml].

(NB: We want to avoid indexing don't as [don't, don, t])

@bglw bglw changed the title Add segments of hyphenated word to index Add segments of words with punctuation to index Apr 11, 2023
@lorenzolewis
Copy link

Firstly, amazing project! Found it through Astro's Starlight and love it!

Do you have any thoughts around similar situations with camelCase and snake_case variants? I imagine that snake_case could use a similar strategy but camelCase might have some added complexity to it to make sure you're not mistakenly over-splitting words (thinking proper names like "McDonalds", etc.).

@bglw
Copy link
Contributor Author

bglw commented Aug 9, 2023

Hey @lorenzolewis 👋

Ooh, I haven't yet given it thought but it seems fine!

I don't think I'm too worried about over-splitting. In that example it isn't a big deal if you can search for "donald" and get a result for "McDonalds" — perhaps having a minimum length on splitting words would help (3+ characters?).

I think the main thing I would like to do for segmented words is de-rank the partial matches. Currently results are ranked by how close your search word is, so if you're searching for con then cone will rank higher than conifer as it's closer in length. But with naive word splitting, myConFig would rank high, because con would be indexed as well as the other words, and it would look like an exact match.

I think this could tap into the new weighting feature, though, and make these partial matches weaker than they would otherwise be. In which case, over-indexing is a negligible problem.

I'll keep you posted — I'll actually look at this next 👀

@lorenzolewis
Copy link

RE: 3+ characters:

I know some Apple naming conventions use something like this: WKWebView (short for WebKitWebView. Would that be split into something like this: [WK, Web, View, WKWebView]? I think this is fairly common from languages that didn't respect namespaces very well, so might be a scenario to keep in mind if you're only wanting to treat 3+ characters as the "cutoff".

But this also opens up another question: Would this be picked up since it's WKWebView instead of WkWebView (notice the lower-case k)?

@bglw
Copy link
Contributor Author

bglw commented Aug 11, 2023

I'll likely use directly, or base this handling on, https://github.com/withoutboats/heck as it has pretty robust splitting:

t!(test8: "this-contains_ ALLKinds OfWord_Boundaries" => "This Contains All Kinds Of Word Boundaries")

Looking at their implementation, WKWebView would be Wk Web View 🎉
(Helpfully also in this example, it doesn't particularly matter whether the first word is indexed uniquely, as searching for wk will match both wk and wkwebview)

RE: RE: 3+ characters:
2+ is probably a safer bet anyway. OpenGL is a good example, where it would be nice to index Open GL.

@bglw
Copy link
Contributor Author

bglw commented Sep 13, 2023

Hey all ! 👋

Good news — this has landed in Pagefind v1.0.0! ✨

See the full release notes here: https://github.com/CloudCannon/pagefind/releases/tag/v1.0.0 💙

Ping me here if you have any questions about the implementation!

@bglw bglw closed this as completed Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Not a bug Pagefind CLI The CLI responsible for indexing content
Projects
None yet
Development

No branches or pull requests

2 participants