-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search index based on headers #215
Comments
Hi @kylebutts 👋 Not at the moment, but it's a great suggestion (that I have heard before, though I can't find an existing issue) This looks like something that could definitely be implemented. I'll spitball two ways one could configure this, either in some automatic way, or using an attribute. AutomaticThis would be some sort of config like (option pending) # pagefind.yml
split_pages_on: "h2" Which would then do some ✨ magic ✨ to produce the desired result. The main concern here is if the ✨ magic ✨ doesn't suit a particular user, and isn't customizable enough. AttributeThis would be a new attribute like (syntax pending) <div data-pagefind-subpage="#getting-started">
<h2 id="getting-started">Getting started</h2>
<p>. . . </p>
</div> or (also syntax pending) <h2 data-pagefind-subpage id="getting-started">Getting started</h2>
<p>. . . </p> This would provide more control over the indexing behavior, but doesn't suit people who can't add attributes here (i.e. the entire page content goes through a Keen to hear your thoughts on configuration and how you would ideally set this up. There are also some corner cases I can forsee, but I'll let those simmer until a direction lands. |
Hi @bglw! Pleasure to meet you. I assume that the index looks something like: each index.html page (or If you were to look at a page and find all headers with I think it might be useful to provide additional attributes to the Your recommended |
So a reason this is a little trickier is that the index actually works the other way around. If we craft a super simple example:
Then a simplified version of the index (if it were JSON) would look something like: {
"pages": ["a.html", "b.html"],
"words": {
"one": [0],
"page": [0, 1],
"two": [1]
}
} This means if we're going to split the page, we need to do so at the time of indexing rather than the time of retrieval. So the index would allow for something like "pages": ["a.html", "a.html#heading-one", "a.html#heading-two"] With the words associated to each "page". There is a moment when the index is being built that we have it mapped the other way, though, so at that point we could split on headings and do what we need to do. It isn't outlandish, it will just need a careful refactor around the 1 file -> 1 page assumption baked in, and making it work with the way the HTML is parsed as a stream. Quirk 1: Do we split the whole page from a heading, or do we try to match it hierarchically? <h1>My Page</h1>
<p>My page text</p>
<div>
<h2>Next heading</h2>
<p>Inner heading text</p>
</div>
<p>Final text</p> A naïve approach would split everything at the |
In any case this has some similarities to some index weighting work that's teed up, so I'll likely wind up looking at them together in the not-too-distant future 🙂 |
Very interesting! I'll keep an eye out :-) Unrelated, but this indexing method reminds me a lot of sparse matrices where it's more space efficient to just store the index of the pages instead of a vector of 0s and 1s. Not sure how you go about it in Rust, but might save space for larger indexes! Anyways,thanks so much for this package. Love the idea of post-SSG processing utilities |
Ah, that contrived example was a little too contrived 😅 A better example would be: {
"pages": ["a.html", "b.html", "c.html"],
"words": {
"one": [0],
"page": [0, 1, 2],
"two": [1],
"three": [2]
}
} They are transferred as the sparse indexes, and then are turned back into bitsets clientside to search quickly |
Does it have to be headers? Can we just put the closest or previous ID in the hash when present? Consider this <table>
<thead>
<tr>
<th>Token name</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr id="rh-color-accent-base-on-light">
<!-- I'm a useful search term! -->
<td data-pagefind-filter="token"><code>--rh-color-accent-base-on-light</code></td>
<td><code>#0066cc</code></td>
</tr>
<tr id="rh-color-accent-base-on-dark">
<!-- I'm a useful search term! -->
<td data-pagefind-filter="token"><code>--rh-color-accent-base-on-dark</code></td>
<td><code>#73bcf7</code></td>
</tr>
</tbody>
</table> Here, I want the pagefind results to link to Another way of saying this is that I want multiple results per page. Say the user searched for [
{
content: 'Some more info about --rh-color-accent-base-on-dark',
url: '/tokens/color/#rh-color-accent-base-on-dark',
},
{
content: 'Some more info about --rh-color-accent-base-on-light',
url: '/tokens/color/#rh-color-accent-base-on-light',
},
] |
Thanks for the samples — I'll definitely look at building this to allow multiple results per page, so there will be a way to achieve what you're after there 🙂 (NB: unrelated to this issue — looking at your code sample @bennypowers, I'll need to implement a way to index those design tokens as individual words rather than a single word. Currently that will index as a single word |
Yes that's correct will need to iced each taxa on the token name. If I have to specify each one in an attr I don't mind that <td data-pagefind-filter="token"
data-pagefind-thingies="color,accent,base,on-light">
<code>--rh-color-accent-base-on-light</code>
</td> |
The way to achieve that right now would be to use index-attrs: <td data-pagefind-filter="token"
data-tokens="color accent base on-light"
data-pagefind-index-attrs="data-tokens">
<code>--rh-color-accent-base-on-light</code>
</td> I'll have a think on an easier way to represent this without having to duplicate the content. Perhaps Pagefind should automatically index a word like (EDIT: Created a new issue for this discussion at #225) |
Awesome thanks. Back to OP, I'll still need to link these back to the hash for the closest/previous ID |
👍 I'll start implementing this fairly soon — my initial plan was for this to be part of a ✨ Pagefind 1.0 ✨ release, but I'll see how things track for whether this makes it out before that. I also have a couple of ideas for an alternative way to implement this, so I might give those a poke and report back. Are you using the Pagefind UI or the JS API directly? |
Js API, so given token names and my conventions, I think I can construct URLs with your snippet. Will find out next week 🙂 |
just a quick update that I did index the token path parts, but those tags still apply to the whole page - I haven't found a way to associate those tags with a particular element on the page, or to forward those tags to the result so i can construct a url |
Roger — I'll do some investigation on this very soon. |
Quick question for your example. If I searched for In other words, are your ideal results for
Or just
|
First example. Multiple results per page |
Hi @bennypowers / @kylebutts — initial update here. I'm still planning out how to best wrap this up as a feature, but I have now implemented what I think will be the primitive backing it. It is currently sitting on a prerelease version. If you're running via npx, you can run:
or if you're pulling the binary directly you can download it from the pagefind-beta release. The feature implemented thus far is that a list of page anchors is returned in the result data. (Additionally, the word locations are easier to access). let search = await pagefind.search("filter");
let result = await search.results[0].data(); Returns: {
// some fields omitted
"url": "/docs/filtering/",
"anchors": [{
"element": "h2",
"id": "tagging-an-element-as-a-filter",
"location": 18
}, {
"element": "h2",
"id": "tagging-an-attribute-as-a-filter",
"location": 87
}],
"locations": [ 3, 6, 23, 40, 51, 65, 93, 96, 107, 116 ]
} This should be the data required to build a header-based result list. No configuration is needed for the above example, as the main search indexes are unaffected, and the page fragment size is a lesser concern for Pagefind. As such, all elements with My intention is for Pagefind to implement this logic in some manner, but it needs a little more consideration for how it fits into the rest of the system. For example, in this configuration, each page is still one matched result, and the fragment data must be loaded before it could be split into sub-results. I think Pagefind will also need to try and index some text alongside the anchors if possible, so that a search result could be displayed as something along the lines of In any case, I would love it if you gave this prerelease a spin! From the sounds of your setup consuming the API directly @bennypowers, I think this would be enough to unblock you. Eager to get any feedback on this feature as it shapes up. |
Hi Liam! This is great; I think the search results work great. Just need to get this working into a search component now :-) |
I found I was unable to derive the kinds of results I wanted from pagefind, but having reconsidered my problem, it seemed that pagefind was not the right tool for the job. I instead opted for fuze.js, since I already possess a data file of my complete search results, and know ahead of time exactly what I'm searching for, and can build URLs for the search results by convention. I'm however planning to adopt pagefind for its intended purpose, which is full-site offline search. |
Note from #265 — the |
The way Algolia DocSearch does this is to chunk content into pretty small pieces and index those instead of whole pages, kind of as @bglw outlined in this comment. Each piece of content has a metadata of heading hierarchy breadcrumbs, e.g. this HTML: <h1 id="page-title">Page title</h1>
<h2 id="subheading">Subheading</h2>
<h3 id="details">Details</h3>
<p>Interesting stuff.</p>
<h3 id="more-details">More details</h3>
<p>Some content in a hierarchy of heading elements.</p> Could return a result like: {
content: 'Some content in a hierarchy of heading elements.',
url: '#more-details',
hierarchy: {
1: 'Page title',
2: 'Subheading',
3: 'More details',
},
} A nice part of heading hierarchies is they also make sense when sorting: you can decide to show |
I haven't landed 100% on the implementation yet, but for now I've taken a different path than spitting the pages into separate chunks in the index. One reason is that if a search is a hit for two sections of a page, I quite like being able to show the result like:
Due to the way Pagefind hashes and chunks content, there's no way to know that any two results are related to each other until their final fragments are loaded, which is usually lazy-loaded on scroll or pagination. So the goal is to keep the results 1:1 with the input pages, but to mark them up such that a synthetic version of the heading split can be returned. This also has the benefit that you don't have to make any decision about this when indexing the site, it's entirely a runtime search config, which makes me happy. (It does mean that fully splitting one result into multiple will be tough if you're showing placeholders before it has loaded). The general idea is that Pagefind will return this shape for each result (which it is currently doing, sans-header-text):
But you won't need to interact with that directly. Instead, the Let me know if you spot any glaring issues with this plan, though! Happy to hear more, but I think this strikes the right balance for Pagefind specifically 🙂 |
Nice! Makes sense. Is |
Currently it's piping through all anchors that existed on the page, the So if you see that your search was a hit on the page at location This won't present anything regarding nesting/hierarchy, so if you're wanting to build a breadcrumb of headings you would need to reconstruct that from the list of anchors, if that makes sense. |
i am looking forward to this feature // some fields omitted
"url": "/docs/filtering/",
"anchors": [{
"element": "h2",
"id": "tagging-an-element-as-a-filter",
"text": null,
"location": 18
}, {
"element": "h2",
"id": "tagging-an-attribute-as-a-filter",
"text": null,
"location": 87
}],
"locations": [ 3, 6, 23, 40, 51, 65, 93, 96, 107, 116 ]
}```
i was expecting this
```json
{
// some fields omitted
"url": "/docs/filtering/",
"anchors": [{
"element": "h2",
"id": "tagging-an-element-as-a-filter",
"text": "Tagging an element as a filter",
"location": 18
}, {
"element": "h2",
"id": "tagging-an-attribute-as-a-filter",
"text": "Tagging an attribute as a filter",
"location": 87
}],
"locations": [ 3, 6, 23, 40, 51, 65, 93, 96, 107, 116 ]
}
}``` |
Hi @anoopsinghbayes 👋 The |
Hey @anoopsinghbayes / all, The Let me know if you take a look at it — automatic results for headings calculated by Pagefind will come soon. |
@bglw checked , I am able to get the text ,thanks a lot |
Hello everyone ! 👋 Great news — this has landed in Pagefind v1.0.0! ✨ See the full release notes here: https://github.com/CloudCannon/pagefind/releases/tag/v1.0.0 💙 And the documentation here: https://pagefind.app/docs/sub-results/ |
Congrats @bglw! This is really exciting stuff !! Some interesting things going on in the astro discord implementing this in starlight right now too :-) |
Hi there!
It's quite common to have headers with id's for linking to subsections of a documentation page. I'm wondering if it's possible to have the search index break up the index by headers?
Here's an example of what I'm talking about. See how the search shows you sections within the page
https://pkgdown.r-lib.org/articles/search.html
The text was updated successfully, but these errors were encountered: