Tokenize spaces when explicitly requested and reduce iterations in occurrence counting loop #6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The original
tokenize
method assembled asequence
of words using the space character as a delimiter, then counted tokens usingsplit()
on the space character. This fails if there are spaces in the words passed to the tokenizer. This is not normally the case when used with a string or array of words, but in the case where you want to use single characters as your tokens, spaces should be supported. This allows cosine comparisons to be done on character arrays and ensures results match other cosine distance implementations.The new test added demonstrates this (passes with this new code, fails with the original).
Additionally, the original loop that counts occurrences of each nGram looped through all words in every segment. This could be quite inefficient. This version stops looping once nGramMax tokens have been counted and is up to 10x faster when comparing arrays of around 100 words.