Skip to content

Move from_preset to base tokenizer classes #648

@jbischof

Description

@jbischof

This PR would be an replication of #621 for our three tokenizer layers: WordPieceTokenizer, BytePairTokenizer, and SentencePieceTokenizer. Unlike #621 we do not need to create new classes but merely move some properties and methods to the existing ones.

Side note: Originally I wanted to move from_preset to tokenizer.Tokenizer, but preset loading differs for the three subclasses above.

Example steps for WordPieceTokenizer:

  • Copy over following methods/properties to base class from BertTokenizer:
    • presets (return {})
    • from_preset with string formatting params to adapt to preset and class names
  • Remove from_preset from each subclass and fill in the string params for from_preset docstring.

Following #621, add a mention of from_preset loading to the generic constructor docstring.

This process should be repeated for every XXTokenizer in the models/ folder.

Metadata

Metadata

Labels

stat:contributions welcomeAdd this label to feature request issues so they are separated out from bug reporting issuestype:featureNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions