Skip to content

Conversation

@mfuntowicz
Copy link
Member

@mfuntowicz mfuntowicz commented Jan 29, 2020

Integrate the BPE-based tokenizers inside transformers.

  • Bert (100% match)
  • DistilBert (100% match)
  • OpenAI GPT (100% match)
  • GPT2 (100% match if no trailing \n)
  • Roberta (100% match if no trailing \n)
  • TransformerXL
  • CTRL (No binding will be provided).

Added priority for Tokenizer with fast implementation in AutoTokenizer this is done through a new mapping (name: class) -> (name: Tuple[class, class]) which represents both the Python and Rust implementation classes. if no Rust implementation is available, it is simply set to None. AutoTokenizer will try to pick the Rust class if not None, otherwise it defaults to the Python one.

Added some matching tests which basically checks that there is a huge % of element wise matching tokens. This is set arbitrary to 0.05 (5%) [i.e. at max, 5% of differences between Python and Rust].

Added parameter return_offsets_mapping=False over encoding methods which will return the offset mapping if using a Rust tokenizer. If using a Python tokenizer, a warning message is displayed through the module logger and the argument is discarded.

@julien-c
Copy link
Member

only took a superficial look, but looks very clean 👍

Excited to use fast tokenizers by default!

@mfuntowicz
Copy link
Member Author

Current CI issues are real and "normal" we need to release the next version of tokenizers lib which will bring all the dependencies.


def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
return super().encode_batch(
[seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be an additional Normalizer. It would also let us keep track of the offsets. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yap make sense 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this deserves a 0.4.3 haha

if return_special_tokens_mask:
encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask
encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings]
if return_offsets_mapping:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't give access to the normalized string somehow, we should maybe provide offsets to the original string here. Wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, currently the offsets are given w.r.t the normalised string ? If this is the case, then yes we may want to provide offsets in the original string then, or expose an utility method doing the mapping in Python.

Is it something we can easily expose on Encoding ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, offsets are related to the normalized string. You can retrieve the original offsets by doing encoding.original_str.offsets[encoding.offsets[X]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't followed the discussion very closely, but shouldn't offsets be returned based on the original string by default?

As an end user I don't think I really care about the normalized (internal) version of my input.

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @julien-c, the normalized string is more an internal representation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for me we should probably default to the original string.

@mfuntowicz mfuntowicz force-pushed the tokenizers-v2 branch 3 times, most recently from 8c70bc6 to ef42cf5 Compare February 10, 2020 12:57
@codecov-io
Copy link

codecov-io commented Feb 11, 2020

Codecov Report

Merging #2674 into master will increase coverage by 0.29%.
The diff coverage is 83.01%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2674      +/-   ##
=========================================
+ Coverage      75%   75.3%   +0.29%     
=========================================
  Files          94      94              
  Lines       15288   15424     +136     
=========================================
+ Hits        11467   11615     +148     
+ Misses       3821    3809      -12
Impacted Files Coverage Δ
src/transformers/__init__.py 98.87% <100%> (ø) ⬆️
src/transformers/tokenization_roberta.py 100% <100%> (ø) ⬆️
src/transformers/tokenization_bert.py 96.92% <100%> (+0.3%) ⬆️
src/transformers/pipelines.py 70.88% <100%> (+0.14%) ⬆️
src/transformers/tokenization_distilbert.py 100% <100%> (ø) ⬆️
src/transformers/tokenization_gpt2.py 96.85% <100%> (+0.58%) ⬆️
src/transformers/tokenization_auto.py 97.22% <100%> (+0.25%) ⬆️
src/transformers/tokenization_transfo_xl.py 37.91% <51.42%> (+5.04%) ⬆️
src/transformers/tokenization_openai.py 82.27% <81.57%> (+0.46%) ⬆️
src/transformers/tokenization_utils.py 90.08% <87.23%> (+3.98%) ⬆️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20fc18f...56748e8. Read the comment docs.

Copy link
Contributor

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! Great job @mfuntowicz!


def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
return super().encode_batch(
[seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this deserves a 0.4.3 haha

@mfuntowicz mfuntowicz force-pushed the tokenizers-v2 branch 2 times, most recently from 88ffca6 to 63660c4 Compare February 12, 2020 13:23
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @mfuntowicz!

setup.py Outdated
install_requires=[
"numpy",
"tokenizers == 0.0.11",
"tokenizers == 0.4.2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>= ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we don't have so many unit tests on tokenizers python's binding for now, I would tend to stick to a specific version that will be tested on the CI. Otherwise it might introduce some flaky tests when releasing new tokenizers versions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, given that we are still introducing breaking changes pretty often in tokenizers, I would strongly advise against that.

if return_special_tokens_mask:
encoding_dict["special_tokens_mask"] = encoding.special_tokens_mask
encoding_dict["special_tokens_mask"] = [e.special_tokens_mask for e in encodings]
if return_offsets_mapping:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for me we should probably default to the original string.

# Prepare inputs as tensors if asked
if return_tensors == "tf" and is_tf_available():
encoding_dict["input_ids"] = tf.constant([encoding_dict["input_ids"]])
encoding_dict["input_ids"] = tf.constant(encoding_dict["input_ids"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok nice, so this will be a tensor with a "batch dimension" equal to the number of encodings when they are splited in overflowing tokens. I like this solution, it's clean. We should document this behavior though.

return vocab_file, merge_file


class _OpenAIGPTCharBPETokenizer(BaseTokenizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to have this class here?

Don't we have an implementation of char-level BPE in tokenizers now?
Here: https://github.com/huggingface/tokenizers/blob/master/bindings/python/tokenizers/implementations/char_level_bpe.py#L9

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need a special OpenaiGPT implementation because it slightly differs from the char-level BPE we have in tokenizers:

  • Normalizer is the same as Bert (BertNormalizer)
  • PreTokenizer is not Whitespace, it's the same as Bert (BertPreTokenizer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put TransformerXL into tokenizers.implementations, may be this one can make its way to tokenizers too. cc @n1t0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not too sure about this. I think tokenizers should stay a library with some generic implementations, with an easy way for everybody to build it's own custom tokenizer when needed. So I'd like to avoid introducing specific implementations for each new model/tokenizer. Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation, and then we'll have as many implementations as models there are in transformers... I think it makes more sense to have specific customization details in transformers, next to the model that actually uses the custom tokenizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, the next thing we'll discuss is whether we should have default vocabularies downloaded automatically with each specific implementation

You mean have all the things that made the success of Transformers? 😜

Jking Well ok for me to keep these in Transformers then.

return symbols


class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this class upstream in tokenizers now that we have a word-level model?

It would be more consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @n1t0 what do you think ? I can put the content of this into tokenizers.implementations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf comment for OpenAIGPTCharBPETokenizer

@LysandreJik LysandreJik merged commit 3f3fa7f into master Feb 19, 2020
@mfuntowicz mfuntowicz deleted the tokenizers-v2 branch February 19, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants