-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/extend re replacement seq #948
Fix/extend re replacement seq #948
Conversation
@rlouf The tests are failing as we need a Hugging Face token to load the NorwAI tokeniser, which is the reason why this PR is needed in the first place. Would it be possible to have a Hugging Face token as a Github secret to deal with this? |
|
That's great, thanks. Can you please get access to the following models with that token?
Those are the culprits of the failing tests. Also, just to be sure, the Alternatively, if that's too much of a hassle, I can simply include the failure cases manually, rather than accessing them from "real" tokenisers. Let me know what you think. |
It's probably better to do this "manually" indeed. |
@rlouf Changed the test to a "manual" one now, and all tests pass 🙂 |
Awesome, thank you! |
This PR is an extension of dottxt-ai#763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue dottxt-ai#762. This PR extends the fix from dottxt-ai#763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.
This PR is an extension of dottxt-ai#763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue dottxt-ai#762. This PR extends the fix from dottxt-ai#763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.
This PR is an extension of dottxt-ai#763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue dottxt-ai#762. This PR extends the fix from dottxt-ai#763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.
This PR is an extension of #763, related to extending the
re_replacement_seq
regex.The new NorwAI models use a tokenizer that has the token
�.
, which leads to the same error as was described in the previous issue #762.This PR extends the fix from #763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.