Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create PagefindIndex struct for rust lib to be more sdk like #751

Merged
merged 3 commits into from
Dec 16, 2024

Conversation

cdxker
Copy link
Contributor

@cdxker cdxker commented Dec 11, 2024

Hi @CloudCannonMain love pagefind a lot, I am trying to use the pagefind rust crate in a similar way to the other wrapper libraries using the pagefind service . We created an example of using pagefind to index a simple json file https://github.com/devflowinc/Pagefind-Example-Usage-Rust , however we need these 2 structs to be exposed publicly.

Let me know if you have another questions, I would love to talk.

@bglw
Copy link
Contributor

bglw commented Dec 11, 2024

Hmm, I'm open to exposing more from the Pagefind library, though I don't think these are the best candidates for making public — these can change between versions and making them part of the public API would lock them in somewhat, thus changing them would require ticking up a major version.

Ideally we can expose this behavior using the service API methods — e.g. the endpoints available in https://pagefind.app/docs/node-api/ — which are defined in service/requests.rs and service/responses.rs.

These represent the programmatic interface to Pagefind that I'm comfortable supporting, whereas the fossick module and the DomParserResult concepts are very internal.

Does that sound feasible?

@cdxker
Copy link
Contributor Author

cdxker commented Dec 11, 2024

This makes sense, the issue with that approach is that it becomes reliant on a top level process to managed all the indexes, ideally we wouldn't have a seperate process running.

Would you be open to me exposing each of theses mappers as a function that could also interface as an api? The main thing I wanted was a way to call these functions directly.

https://github.com/CloudCannon/pagefind/blob/main/pagefind/src/service/mod.rs#L138-L282

@bglw
Copy link
Contributor

bglw commented Dec 11, 2024

Ah sorry, I wasn't super clear in my reply — yes my suggestion is to add a library interface to Pagefind that is based on the service API, going through the same final code paths, but not going through the service pipe itself. So yes exposing each of those request/responses as a library function would be ideal :)

@cdxker
Copy link
Contributor Author

cdxker commented Dec 11, 2024

Ok great, I'll spec something out really quick for this. I was so excited that I already setup a PR for this based on the api of my fork. devflowinc/trieve#2934

Thank you for fast response time.

@cdxker cdxker force-pushed the expose-more-structs branch 2 times, most recently from 83bda22 to e6ce8ef Compare December 11, 2024 20:32
@cdxker
Copy link
Contributor Author

cdxker commented Dec 11, 2024

I created a struct PagefindIndex in service/api.rs that provides a similar interface that service/mod.rs does. Additionally i refactored service/mod.rs to use the PagefindIndex struct to keep code drift minimal.

How would I test that I didn't introduce any breaking changes for the non-rust api's https://pagefind.app/docs/node-api/

I tested using cargo test locally and all unit tests passed, I wasn't able to setup toolproof for the ./test_interactive.sh command to work.

@cdxker
Copy link
Contributor Author

cdxker commented Dec 11, 2024

I also updated my example code for how to the pagefind.

https://github.com/devflowinc/Pagefind-Example-Usage-Rust/blob/main/src/main.rs

@bglw
Copy link
Contributor

bglw commented Dec 12, 2024

I have enabled the test runners on this PR — it does look like this change has broken the existing API endpoints. I don't have time to review today but can tomorrow.

Curious to hear what issues you have hit setting up Toolproof — that project is in its nascent stage but I would like to start fixing any issues people are hitting there too.

@cdxker
Copy link
Contributor Author

cdxker commented Dec 12, 2024

I have enabled the test runners on this PR — it does look like this change has broken the existing API endpoints. I don't have time to review today but can tomorrow.

Thank you for running the tests, going to see if I can reproduce the error cases based on the CI run.

Curious to hear what issues you have hit setting up Toolproof — that project is in its nascent stage but I would like to start fixing any issues people are hitting there too.

called `Result::unwrap()` on an `Err` value: "Could not auto detect a chrome executable"
thread 'tokio-runtime-worker' panicked at toolproof/src/definitions/browser/mod.rs:37:67:

After running all the setup commands I ran into this error

Fixed after installing chromium sudo pacman -S chromium I have brave on this machine but I guess it didn't pick up that binary.

@bglw
Copy link
Contributor

bglw commented Dec 12, 2024

Ah interesting — what operating system are you on, and do you have Chrome installed? You can follow the logic the underlying crate uses to find a Chrom(e/ium) here: https://github.com/mattsse/chromiumoxide/blob/main/src/detection.rs

If you can point a CHROME environment variable at a Chrome or Chromium executable that should take precedence.

Edit: Nice one! I'll look at catching this error in Toolproof itself and providing better error messaging.

@cdxker
Copy link
Contributor Author

cdxker commented Dec 13, 2024

I believe I found the fix the tests here
https://github.com/CloudCannon/pagefind/pull/751/files#diff-7f42c2ca8eebf095101e142cda9c05931d16fe6c7cc41a813f4c4acdf64887cdR139

(Accidentally removed this line of code in translation).

Locally 2 tests fail against main, same tests fail against on my branch. Can you rerun the workflow just to verify?

@cdxker cdxker force-pushed the expose-more-structs branch from f9591cc to 5f94a48 Compare December 13, 2024 04:57
@cdxker cdxker changed the title feature: expose more structs to make pagefind rust lib more sdk like Create PagefindIndex struct for rust lib to be more sdk like Dec 13, 2024
@cdxker
Copy link
Contributor Author

cdxker commented Dec 14, 2024

@bglw sorry for the double ping, when would you be free to review? Really want to get a release out so we can ship to production and iterate on pagefind search quality more, and would rather not publish a new cargo crate

@bglw
Copy link
Contributor

bglw commented Dec 15, 2024

Will look at this tomorrow — initial skim looks good though!

@bglw
Copy link
Contributor

bglw commented Dec 16, 2024

Hi @cdxker 👋

I have done some work on top of this branch — I can push it to your remote if you enable access to maintainers, otherwise if you want to cherry pick my commit from 5ecb1fd into this branch we can continue from there.

High level changes in that commit:

  • Removed a range of types that were erroneously public from the library interface. Refined this down to the types we're happy supporting long-term.
    • Theoretically this would be breaking, but the types that happen to be public as a library up until now haven't been covered in the semver historically, so I won't regard this change as breaking (though from this release onward the Rust API will be included).
  • PagefindInboundConfig is no longer public, and PagefindServiceConfig must be used. It also must be constructed through the PagefindServiceConfigBuilder, so that adding new options to the service config isn't a breaking change. You can see this usage in a test here.
  • Slightly better error handling/propagation
  • Renamed the api.rs functions and arguments to match the other external Node/Python API documentation
  • Refined doc comments and module documentation

Let me know how you see these changes, hopefully it's compatible with what you're after.

I'll still mulling over the Rust API. The Node and Python APIs have the benefit of new arguments not being breaking changes, since they go through serialization and have defaults. However with the Rust API, adding a new argument to add_custom_record for example would be a breaking change. 🤔

@cdxker
Copy link
Contributor Author

cdxker commented Dec 16, 2024

Will cherry pick it on in just a minute.

(The button to allow maintainer edits seems to have disappeared)

@cdxker
Copy link
Contributor Author

cdxker commented Dec 16, 2024

@bglw All these changes look great on my end, just pushed the cherry pick'ed commit . I wan't sure how you wanted to deal with error handling so I just deferred it to you.

As for limiting breaking changes, the best way is likely to pass sa struct type to each function instead of a list of arguments. That way we can do a few serialization hacks. Its more messy in straight rust though.

@bglw
Copy link
Contributor

bglw commented Dec 16, 2024

Yeah, I think the only true way would be to move each function into a builder pattern but that becomes pretty non-ergonomic to use so I'm happy to ship it as-is.

I am very resistant to breaking changes though, so if there's anything you're wanting to have added to the API now is the time to do it. Once we ship this I won't want to make a major release just for API changes, so we'll need to add something like a PagefindIndexV2 struct with new methods.

I'll need to look into those two failing tests — they're failing in this PR run as well, however both this branch and main are passing on my machine. Possibly a non-deterministic test, though it seems like a recent regression if so.

@cdxker
Copy link
Contributor Author

cdxker commented Dec 16, 2024

The current api interface works for me, even tested integrating with our example usecase. The only functions I would need are add_custom_record and get_files. We might do a little bit experimentation with different indexing operations, but if we do that we'll work off our fork before proposing a final draft to the upstream.

@bglw
Copy link
Contributor

bglw commented Dec 16, 2024

Sweet as — alright I'll merge and release as soon as I figure out what is happening in these tests.

@bglw
Copy link
Contributor

bglw commented Dec 16, 2024

Phew okay, located the issue. Legitimate bug that will have possibly been affecting ranking in some cases, where it isn't guaranteed how a compound word will be ranked.

I wasn't seeing it locally as the machine I'm on had an older Rust version — so I believe the culprit will be the recent(ish) Rust version that changed the implementation of the std sorting algorithms. I think it was still a bug before, these tests just happened to skate by.

Resolved in this commit, if you could cherry pick that onto this branch

Updated commit, if you could cherry pick that onto this branch :)

Context since this doesn't have its own PR:

In the "Compound words prefixes prioritise the lower weight" test, when you search three, Pagefind finds a match for both Three and ThreeAntelopes. This is because compound words are indexed as their full word, and as their parts. For each part, a partial weight is applied, so the Three it matches is weighted 0.5, and the ThreeAntelopes it matches is weighted 1.0

The section of code with the bug identifies these cases and ensures that we use the lower weighted match. Since we only searched three, we want the 0.5 weight of Three, not the 1.0 weight of ThreeAntelopes. However, it wasn't also updating the length bonus applied to search terms.

This means whether it looked at Three or ThreeAntelopes first would determine which length bonus it used. three to Three is an exact hit, three to ThreeAntelopes gets deranked due to the length delta. Prior Rust versions always happened to sort Three earlier, and that has reversed in recent Rust versions.

@cdxker
Copy link
Contributor Author

cdxker commented Dec 16, 2024

Wow pretty nasty bug, just cherry-pick'ed your commit so it should be ready to rerun the workflows

@bglw
Copy link
Contributor

bglw commented Dec 16, 2024

Excellent! Merging now.

I'd like to roll a few other little fixes in to a release so there's something more user-facing, so I'll get a couple things in there today/tomorrow and then fire a release out (~30 hours from now). Sorry to delay!

@bglw bglw merged commit 99e8618 into CloudCannon:main Dec 16, 2024
4 checks passed
@bglw
Copy link
Contributor

bglw commented Dec 18, 2024

👋 @cdxker This has landed in Pagefind v1.3.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants