Skip to content

Conversation

@ArthurZucker
Copy link
Collaborator

What does this PR do?

Fixes #29440 by always adding the user_defined_symbols to the tokenizer, otherwise because no merges exist for them, they will be split.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker requested a review from pcuenca March 6, 2024 02:42
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! There are a few additional tokens inuser_defined_symbols, I'll push PRs to the repos.

@ArthurZucker ArthurZucker marked this pull request as ready for review March 6, 2024 13:47
@ArthurZucker
Copy link
Collaborator Author

I will add a test and we can merge

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood! cc @ydshieh that this will introduce some failing tests in the slow CI, we're aware (maybe we can skip those until we get to it? after the release)

@require_tokenizers
class LlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
from_pretrained_id = "hf-internal-testing/llama-tokenizer"
from_pretrained_id = ["hf-internal-testing/llama-tokenizer", "meta-llama/Llama-2-7b-hf"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ArthurZucker ArthurZucker changed the title [GemmaConverte] use user_defined_symbols [GemmaConverter] use user_defined_symbols Mar 19, 2024
@ydshieh
Copy link
Collaborator

ydshieh commented Mar 19, 2024

Thank you for warning me. But could you share with me the reason to merge before we resolve the failing tests?

@ArthurZucker
Copy link
Collaborator Author

We need to release !

@ydshieh
Copy link
Collaborator

ydshieh commented Mar 19, 2024

It would be very appreciated that the expected failing tests to be skipped temporarily (before they would be fixed/updated after the release).

@ArthurZucker
Copy link
Collaborator Author

We can open a PR to skip / fix them tomorrow!

@ArthurZucker
Copy link
Collaborator Author

Failing tests are unrelated, merging!

@ArthurZucker ArthurZucker merged commit 2f9a3ed into main Mar 19, 2024
@ArthurZucker ArthurZucker deleted the fix-spm-converter branch March 19, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AutoTokenizer tokenization issue

5 participants