Incorrect Document Content in BlenderBot Tokenizer

`The BlenderBot tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not.` However, the examples in BlenderBot Tokenizer (`BlenderbotTokenizer`) are the same:

https://github.com/huggingface/transformers/blob/bd469c40659ce76c81f69c7726759d249b4aef49/src/transformers/models/blenderbot/tokenization_blenderbot.py#L105

The same issue also occurs in `BlenderbotTokenizerFast`:

https://github.com/huggingface/transformers/blob/bd469c40659ce76c81f69c7726759d249b4aef49/src/transformers/models/blenderbot/tokenization_blenderbot_fast.py#L64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect Document Content in BlenderBot Tokenizer #19938

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect Document Content in BlenderBot Tokenizer #19938

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions