Integrate fast tokenizers library inside transformers #2674

mfuntowicz · 2020-01-29T16:30:28Z

Integrate the BPE-based tokenizers inside transformers.

Added priority for Tokenizer with fast implementation in AutoTokenizer this is done through a new mapping (name: class) -> (name: Tuple[class, class]) which represents both the Python and Rust implementation classes. if no Rust implementation is available, it is simply set to None. AutoTokenizer will try to pick the Rust class if not None, otherwise it defaults to the Python one.

Added some matching tests which basically checks that there is a huge % of element wise matching tokens. This is set arbitrary to 0.05 (5%) [i.e. at max, 5% of differences between Python and Rust].

Added parameter return_offsets_mapping=False over encoding methods which will return the offset mapping if using a Rust tokenizer. If using a Python tokenizer, a warning message is displayed through the module logger and the argument is discarded.

src/transformers/tokenization_auto.py

julien-c · 2020-01-30T23:29:58Z

only took a superficial look, but looks very clean 👍

Excited to use fast tokenizers by default!

mfuntowicz · 2020-02-03T15:56:04Z

Current CI issues are real and "normal" we need to release the next version of tokenizers lib which will bring all the dependencies.

src/transformers/tokenization_utils.py

n1t0 · 2020-02-06T15:26:21Z

src/transformers/tokenization_transfo_xl.py

+
+    def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
+        return super().encode_batch(
+            [seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]


This should probably be an additional Normalizer. It would also let us keep track of the offsets. What do you think?

Yap make sense 👍

Or maybe this deserves a 0.4.3 haha

src/transformers/tokenization_utils.py

n1t0 · 2020-02-06T15:33:02Z

src/transformers/tokenization_utils.py

        if return_special_tokens_mask:
-            encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask
+            encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings]
+        if return_offsets_mapping:


If we don't give access to the normalized string somehow, we should maybe provide offsets to the original string here. Wdyt?

Hum, currently the offsets are given w.r.t the normalised string ? If this is the case, then yes we may want to provide offsets in the original string then, or expose an utility method doing the mapping in Python.

Is it something we can easily expose on Encoding ?

Yes, offsets are related to the normalized string. You can retrieve the original offsets by doing encoding.original_str.offsets[encoding.offsets[X]]

I haven't followed the discussion very closely, but shouldn't offsets be returned based on the original string by default?

As an end user I don't think I really care about the normalized (internal) version of my input.

Thoughts?

I tend to agree with @julien-c, the normalized string is more an internal representation

Same for me we should probably default to the original string.

codecov-io · 2020-02-11T04:04:49Z

Codecov Report

Merging #2674 into master will increase coverage by 0.29%.
The diff coverage is 83.01%.

@@            Coverage Diff            @@
##           master   #2674      +/-   ##
=========================================
+ Coverage      75%   75.3%   +0.29%     
=========================================
  Files          94      94              
  Lines       15288   15424     +136     
=========================================
+ Hits        11467   11615     +148     
+ Misses       3821    3809      -12

Impacted Files	Coverage Δ
src/transformers/__init__.py	`98.87% <100%> (ø)`	⬆️
src/transformers/tokenization_roberta.py	`100% <100%> (ø)`	⬆️
src/transformers/tokenization_bert.py	`96.92% <100%> (+0.3%)`	⬆️
src/transformers/pipelines.py	`70.88% <100%> (+0.14%)`	⬆️
src/transformers/tokenization_distilbert.py	`100% <100%> (ø)`	⬆️
src/transformers/tokenization_gpt2.py	`96.85% <100%> (+0.58%)`	⬆️
src/transformers/tokenization_auto.py	`97.22% <100%> (+0.25%)`	⬆️
src/transformers/tokenization_transfo_xl.py	`37.91% <51.42%> (+5.04%)`	⬆️
src/transformers/tokenization_openai.py	`82.27% <81.57%> (+0.46%)`	⬆️
src/transformers/tokenization_utils.py	`90.08% <87.23%> (+3.98%)`	⬆️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20fc18f...56748e8. Read the comment docs.

n1t0

Looks really good! Great job @mfuntowicz!

setup.py

src/transformers/tokenization_bert.py

src/transformers/tokenization_utils.py

n1t0 · 2020-02-11T14:51:31Z

src/transformers/tokenization_transfo_xl.py

+
+    def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
+        return super().encode_batch(
+            [seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]


Or maybe this deserves a 0.4.3 haha

thomwolf

Great work @mfuntowicz!

thomwolf · 2020-02-12T14:44:16Z

setup.py

    install_requires=[
        "numpy",
-        "tokenizers == 0.0.11",
+        "tokenizers == 0.4.2",


As we don't have so many unit tests on tokenizers python's binding for now, I would tend to stick to a specific version that will be tested on the CI. Otherwise it might introduce some flaky tests when releasing new tokenizers versions

Well, given that we are still introducing breaking changes pretty often in tokenizers, I would strongly advise against that.

src/transformers/tokenization_utils.py

thomwolf · 2020-02-12T15:06:31Z

src/transformers/tokenization_utils.py

        if return_special_tokens_mask:
-            encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask
+            encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings]
+        if return_offsets_mapping:


Same for me we should probably default to the original string.

thomwolf · 2020-02-12T15:09:29Z

src/transformers/tokenization_utils.py

        # Prepare inputs as tensors if asked
        if return_tensors == "tf" and is_tf_available():
-            encoding_dict["input_ids"] = tf.constant([encoding_dict["input_ids"]])
+            encoding_dict["input_ids"] = tf.constant(encoding_dict["input_ids"])


Ok nice, so this will be a tensor with a "batch dimension" equal to the number of encodings when they are splited in overflowing tokens. I like this solution, it's clean. We should document this behavior though.

src/transformers/tokenization_bert.py

thomwolf · 2020-02-13T09:42:28Z

src/transformers/tokenization_openai.py

        return vocab_file, merge_file
+
+
+class _OpenAIGPTCharBPETokenizer(BaseTokenizer):


Why do we have to have this class here?

Don't we have an implementation of char-level BPE in tokenizers now?
Here: https://github.com/huggingface/tokenizers/blob/master/bindings/python/tokenizers/implementations/char_level_bpe.py#L9

We do need a special OpenaiGPT implementation because it slightly differs from the char-level BPE we have in tokenizers:

Normalizer is the same as Bert (BertNormalizer)

PreTokenizer is not Whitespace, it's the same as Bert (BertPreTokenizer)

If we put TransformerXL into tokenizers.implementations, may be this one can make its way to tokenizers too. cc @n1t0

Honestly, I'm not too sure about this. I think tokenizers should stay a library with some generic implementations, with an easy way for everybody to build it's own custom tokenizer when needed. So I'd like to avoid introducing specific implementations for each new model/tokenizer. Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation, and then we'll have as many implementations as models there are in transformers... I think it makes more sense to have specific customization details in transformers, next to the model that actually uses the custom tokenizer.

Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation

You mean have all the things that made the success of Transformers? 😜

Jking Well ok for me to keep these in Transformers then.

src/transformers/tokenization_roberta.py

thomwolf · 2020-02-13T09:45:25Z

src/transformers/tokenization_transfo_xl.py

            return symbols


+class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):


Can we move this class upstream in tokenizers now that we have a word-level model?

It would be more consistent.

cc @n1t0 what do you think ? I can put the content of this into tokenizers.implementations

cf comment for OpenAIGPTCharBPETokenizer

Signed-off-by: Morgan Funtowicz <[email protected]>

mfuntowicz requested review from LysandreJik, julien-c, n1t0 and thomwolf January 29, 2020 16:30

julien-c reviewed Jan 30, 2020

View reviewed changes

src/transformers/tokenization_auto.py Outdated Show resolved Hide resolved

n1t0 mentioned this pull request Feb 3, 2020

Tokenizer import error huggingface/tokenizers#120

Closed

mfuntowicz force-pushed the tokenizers-v2 branch from aed66b0 to 66cd67e Compare February 6, 2020 10:40

n1t0 reviewed Feb 6, 2020

View reviewed changes

mfuntowicz force-pushed the tokenizers-v2 branch 3 times, most recently from 8c70bc6 to ef42cf5 Compare February 10, 2020 12:57

mfuntowicz mentioned this pull request Feb 10, 2020

Using fast tokenizers with pipelines #2775

Closed

mfuntowicz force-pushed the tokenizers-v2 branch from b6cad60 to b9305c0 Compare February 11, 2020 10:49

n1t0 approved these changes Feb 11, 2020

View reviewed changes

mfuntowicz force-pushed the tokenizers-v2 branch 2 times, most recently from 88ffca6 to 63660c4 Compare February 12, 2020 13:23

thomwolf reviewed Feb 13, 2020

View reviewed changes

n1t0 mentioned this pull request Feb 18, 2020

Python improvements huggingface/tokenizers#155

Merged

mfuntowicz added 9 commits February 19, 2020 15:54

Implemented fast version of tokenizers

f8f7487

Signed-off-by: Morgan Funtowicz <[email protected]>

Bumped tokenizers version requirements to latest 0.2.1

c435009

Signed-off-by: Morgan Funtowicz <[email protected]>

Added matching tests

96bc6e6

Signed-off-by: Morgan Funtowicz <[email protected]>

Matching OpenAI GPT tokenization !

c2a5805

Signed-off-by: Morgan Funtowicz <[email protected]>

Matching GPT2 on tokenizers

92ce90d

Signed-off-by: Morgan Funtowicz <[email protected]>

Expose add_prefix_space as constructor parameter for GPT2

0e19ed3

Signed-off-by: Morgan Funtowicz <[email protected]>

Matching Roberta tokenization !

7f5e943

Signed-off-by: Morgan Funtowicz <[email protected]>

Removed fast implementation of CTRL.

8d4322a

Signed-off-by: Morgan Funtowicz <[email protected]>

Binding TransformerXL tokenizers to Rust.

02dcd7c

Signed-off-by: Morgan Funtowicz <[email protected]>

LysandreJik merged commit 3f3fa7f into master Feb 19, 2020

mfuntowicz deleted the tokenizers-v2 branch February 19, 2020 19:41

world2vec mentioned this pull request Jun 22, 2022

DeBERTa V3 Fast Tokenizer #14712

Closed

lin1490188 mentioned this pull request Apr 13, 2023

多轮对话报错 PhoebusSi/Alpaca-CoT#73

Closed

Maxhyl mentioned this pull request Aug 11, 2023

模型训练问题 modelscope/modelscope#459

Closed

socket-security bot mentioned this pull request Jul 1, 2025

Bump transformers from 4.52.4 to 4.53.0 alphasecio/prompt-guard#36

Closed

This was referenced Jul 17, 2025

[Snyk] Fix for 2 vulnerabilities kingjay66/unilmf#259

Open

[Snyk] Security upgrade transformers from 4.5.1 to 4.52.0 kingjay66/unilmf#260

Open

socket-security bot mentioned this pull request Aug 1, 2025

Bump transformers from 4.53.2 to 4.54.1 alphasecio/prompt-guard#39

Merged

socket-security bot mentioned this pull request Aug 12, 2025

[Snyk] Security upgrade transformers from 4.5.1 to 4.53.0 kingjay66/unilmf#271

Open

socket-security bot mentioned this pull request Sep 1, 2025

Bump transformers from 4.55.0 to 4.56.0 alphasecio/prompt-guard#43

Closed

This was referenced Sep 25, 2025

[Snyk] Security upgrade transformers from 4.30.2 to 4.53.0 kingjay66/unilmf#278

Open

[Snyk] Security upgrade transformers from 2.10.0 to 4.53.0 kingjay66/unilmf#279

Open

[Snyk] Security upgrade transformers from 4.5.1 to 4.53.0 kingjay66/unilmf#281

Open

socket-security bot mentioned this pull request Nov 1, 2025

Bump transformers from 4.56.2 to 4.57.1 alphasecio/prompt-guard#47

Closed

		return vocab_file, merge_file


		class _OpenAIGPTCharBPETokenizer(BaseTokenizer):

		return symbols


		class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):

Integrate fast tokenizers library inside transformers #2674

Integrate fast tokenizers library inside transformers #2674

Uh oh!

Conversation

mfuntowicz commented Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

julien-c commented Jan 30, 2020

Uh oh!

mfuntowicz commented Feb 3, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

n1t0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mfuntowicz commented Jan 29, 2020 •

edited

Loading

codecov-io commented Feb 11, 2020 •

edited

Loading