Skip to content

Conversation

@Narsil
Copy link
Contributor

@Narsil Narsil commented Jun 6, 2022

Should get #935 started.

Ideally we should test compliance of some well known tokenizers
Maybe a small tokenizer training.

Feature is named unstable_wasm (as well as the example). Should at least make sure people understand that 1:1 parity is NOT to be expected.

The reason is that onig the current regexp engine has to be swapped out for fancy_regex which should match, but we have no guarantee (and onig is a C dependency without any wasm support that we know of).

The esaxx-rs has been modified to allow dropping out of the C dependency requirements, it uses the rust code (which is slower than it's cpp counterpart). Then all optional features are disabled for wasm cli, http and progressbar.

This PR also adds esaxx_fast feature to enable easier control of the C version of the esaxx-rs code. Since features are additive there was no way to remove the cpp feature only for wasm. So instead there's a new feature within the default feature, which gets disabled by the example default-features = false.

@Narsil Narsil requested a review from McPatate June 6, 2022 18:47
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 6, 2022

The documentation is not available anymore as the PR was closed or merged.

@Narsil Narsil mentioned this pull request Jun 6, 2022
4 tasks
@Narsil Narsil requested a review from mishig25 June 7, 2022 05:36
Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should remove unnecessary files like licenses and CI stuff in the example.

Was the deletion of the Cargo.locks in the bindings intended ?

Also I'm curious, what does Sys in SysRegex stand for ?

Cool to bring tokenizers to wasm ! 🔥

@@ -0,0 +1,11 @@
install:
- appveyor-retry appveyor DownloadFile https://win.rustup.rs/ -FileName rustup-init.exe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use appveyor for the CI ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the default template for wasm app. I don't really see a lot of downsides of keeping them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave it up to the user to know or not to add these kinds of files, they don't showcase anything particular to tokenizers and are readily available on wasm tutorials. But that's just my 2 cents !

@@ -0,0 +1,69 @@
language: rust
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah maybe these CI files were autogenerated ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,69 @@
<div align="center">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also may want to change the README's content at some point ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a simple line so it's not the pure wasm tutorial (commands explained in it are still useful I find)

Comment on lines 13 to 21
#[wasm_bindgen]
extern "C" {
fn alert(s: &str);
}

#[wasm_bindgen]
pub fn greet() {
alert("Hello, {{project-name}}!");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may want to remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed !

@@ -1,4 +1,4 @@
use onig::Regex;
use crate::utils::SysRegex as Regex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SysRegex is used in other places of the code, won't this alias make the code a bit confusing ? Either work for me, SysRegex or Regex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of Regex everywhere in code.

SysRegex tends to imply System meaning the real underlying regex engine.
Happy to put a better name.

And I should rename everything instead of using as Regex it was because of laziness :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with SysRegex, as long as it's consistent everywhere :)

Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@Narsil Narsil merged commit adf90dc into main Jun 10, 2022
@McPatate McPatate deleted the wasm_support branch June 10, 2022 13:02
Narsil added a commit that referenced this pull request Aug 23, 2022
…#1009)

* Adding `unstable_wasm` feature + example to run `tokenizers` on wasm.

Co-Authored-By: josephrocca <[email protected]>
Co-Authored-By: Matthias Brunel <[email protected]>

* Adding some serialization tests.

* Updating with comments.

Co-authored-by: josephrocca <[email protected]>
Co-authored-by: Matthias Brunel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants