Ensure tokens don't end up with leading or trailing whitespace#79
Open
PeterReid wants to merge 1 commit intohexgrad:mainfrom
Open
Ensure tokens don't end up with leading or trailing whitespace#79PeterReid wants to merge 1 commit intohexgrad:mainfrom
PeterReid wants to merge 1 commit intohexgrad:mainfrom
Conversation
Previously, two spaces, for example between sentences, would lead to the token following the spaces being prefixed by a space. That would lead to it registering as not in the lexicon, and then passing the prefixed word into the fallback.
joshwhiton
added a commit
to joshwhiton/misaki
that referenced
this pull request
Dec 30, 2025
- PR hexgrad#90: Restrict spacy<4 to avoid pre-release/yanked versions Fixes Python 3.13 compatibility issues with thinc/blis dependencies - PR hexgrad#79: Strip whitespace from merged tokens Fixes lexicon lookup failures when multiple spaces appear between words
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously, two spaces, for example between sentences, would lead to the token following the spaces being prefixed by a space. That would lead to it registering as not in the lexicon, and then passing the prefixed word into the fallback.
There are a few places it would have been possible to add this strip() call, but I tried to choose the most sensible one. Where ever it goes, it should be somewhere that affects both the lexicon check and the fallback. As an example of what was going on.
"Sentence(SPACE)one.(SPACE)(SPACE)And(SPACE)two." would end up trying to find " And" in the lexicon, not finding it, and then passing that string to the fallback. The transformer-based fallback has no idea what to make of that leading space (as it's not in the training set at all) and was making it into random-seeming phonemes.