Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request]: Custom Tokenizers #181

Closed
ospfranco opened this issue Nov 3, 2024 · 2 comments · Fixed by #184
Closed

[Request]: Custom Tokenizers #181

ospfranco opened this issue Nov 3, 2024 · 2 comments · Fixed by #184
Assignees
Labels

Comments

@ospfranco
Copy link
Contributor

ospfranco commented Nov 3, 2024

What do you need?

Tokenizers are pieces of C code which allow sqlite to break a sequence of characters into a series of tokens which the can be used for more accurate full-text-search (via the fts5 extension).

The basic idea here is to allow people to implement their own custom tokenizers. This is partly achieved right now.

  • Add a tokenizers key into package.json, it's an array of strings with the name of each tokenizer
  • A header file is generated with the name of the functions to be implemented. Basically the entry point to register the tokenizer
  • An empty implementation file is created. The user will then have to inside of this tokenizer file and implement the corresponding C code to create their tokenizer.
  • Via macros the function is injected and executed when the db connection is opened.

This approach gives a lot of flexibility for each app to implement their own custom tokenizers without digging into op-sqlite code. The ability to generate code and inject it into the compilation process also opens the possibility for other cool stuff like custom aggregators and functions, all implemented in C and made available through sqlite's SQL queries.

Currently, this is partly implemented but looking for sponsors to finish the work as it needs:

  • Android support. Figure out how to replicate the codegen and file inclusion into CMakeLists.
  • Polish the inclusion of the header files to make sure it is robust
  • Tests

Here an example of a working simple tokenizer which is already working on the test branch and inside of the sample app.

CleanShot 2024-11-03 at 10 11 52@2x

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar
@ospfranco ospfranco added the Fund label Nov 3, 2024
@ospfranco ospfranco self-assigned this Nov 3, 2024
@ospfranco
Copy link
Contributor Author

Branch with support is out. Please follow the instructions here:

https://ospfranco.notion.site/Tokenizers-Beta-13c602a7113b80a296c8d51ad710658f?pvs=4

Please also help test clean environments without tokenizers just to make sure compilation process is not frocked up.

@ArindamRayMukherjee
Copy link

Compilation fails on this branch.
Reproducible PR here - danceaway-app/opsqlite-trials#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants