-
Notifications
You must be signed in to change notification settings - Fork 310
Closed
Labels
stat:contributions welcomeAdd this label to feature request issues so they are separated out from bug reporting issuesAdd this label to feature request issues so they are separated out from bug reporting issuestype:featureNew feature or requestNew feature or request
Description
This PR would be an replication of #621 for our three tokenizer layers: WordPieceTokenizer, BytePairTokenizer, and SentencePieceTokenizer. Unlike #621 we do not need to create new classes but merely move some properties and methods to the existing ones.
Side note: Originally I wanted to move from_preset to tokenizer.Tokenizer, but preset loading differs for the three subclasses above.
Example steps for WordPieceTokenizer:
- Copy over following methods/properties to base class from
BertTokenizer:presets(return{})from_presetwith string formatting params to adapt to preset and class names
- Remove
from_presetfrom each subclass and fill in the string params forfrom_presetdocstring.
Following #621, add a mention of from_preset loading to the generic constructor docstring.
This process should be repeated for every XXTokenizer in the models/ folder.
Metadata
Metadata
Assignees
Labels
stat:contributions welcomeAdd this label to feature request issues so they are separated out from bug reporting issuesAdd this label to feature request issues so they are separated out from bug reporting issuestype:featureNew feature or requestNew feature or request