Skip to content

Incorrect Document Content in BlenderBot Tokenizer #19938

@chujiezheng

Description

@chujiezheng

The BlenderBot tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not. However, the examples in BlenderBot Tokenizer (BlenderbotTokenizer) are the same:

The same issue also occurs in BlenderbotTokenizerFast:

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions