Skip to content

Commit

Permalink
Merge pull request #455 from gahdritz/main
Browse files Browse the repository at this point in the history
Fix a bug w.r.t. how local tokenizers are handled
  • Loading branch information
dirkgr authored Mar 12, 2024
2 parents ed47c29 + d297f88 commit 7eb7f3d
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 5 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Fixed

- Don't log garbage on nodes that aren't rank 0
- Don't crash in the HF code when we are referring to a tokenizer in a local file


## [v0.2.5](https://github.com/allenai/OLMo/releases/tag/v0.2.5) - 2024-03-06
Expand Down
17 changes: 12 additions & 5 deletions olmo/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,11 +111,18 @@ def from_checkpoint(cls, checkpoint_dir: PathOrStr) -> Tokenizer:
model_config = ModelConfig.load(config_path, key="model")

# Initialize tokenizer and validate vocab size.
tokenizer = cls.from_pretrained(
tokenizer_config.identifier,
eos_token_id=model_config.eos_token_id,
pad_token_id=model_config.pad_token_id,
)
if Path(tokenizer_config.identifier).is_file():
tokenizer = cls.from_file(
tokenizer_config.identifier,
eos_token_id=model_config.eos_token_id,
pad_token_id=model_config.pad_token_id,
)
else:
tokenizer = cls.from_pretrained(
tokenizer_config.identifier,
eos_token_id=model_config.eos_token_id,
pad_token_id=model_config.pad_token_id,
)
if model_config.vocab_size != tokenizer.vocab_size:
raise OLMoConfigurationError("vocab size mismatch between config and tokenizer")
return tokenizer
Expand Down

0 comments on commit 7eb7f3d

Please sign in to comment.