Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize spaces when explicitly requested and reduce iterations in occurrence counting loop #6

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

colin-brown
Copy link

The original tokenize method assembled a sequence of words using the space character as a delimiter, then counted tokens using split() on the space character. This fails if there are spaces in the words passed to the tokenizer. This is not normally the case when used with a string or array of words, but in the case where you want to use single characters as your tokens, spaces should be supported. This allows cosine comparisons to be done on character arrays and ensures results match other cosine distance implementations.

The new test added demonstrates this (passes with this new code, fails with the original).

Additionally, the original loop that counts occurrences of each nGram looped through all words in every segment. This could be quite inefficient. This version stops looping once nGramMax tokens have been counted and is up to 10x faster when comparing arrays of around 100 words.

…ed as expected. Improved tokenizer to 1. create tokens for spaces when supplied array includes them 2. optimize occurance counting loop so that iteration stops once nGramMax tokens acquired
@colin-brown
Copy link
Author

This is my first time using github to fork and make a pull request, so please let me know if I have done anything wrong or if you have any questions about the changes.

@andreekeberg
Copy link
Owner

Hi @colin-brown, just updating this to let you know your pull request has been noted and is very appreciated! I'm currently working on a new version of the library with some other updates to the tokenizer and will be reviewing your contribution and try to merge this with the new version as soon as possible.

If I have any questions while reviewing and/or merging with my other updates, I will let you know in this thread!

Once again, thank you for your contribution, it's incredibly fun to see how much interest this project has generated, something I didn't expect while building the first version!

@colin-brown
Copy link
Author

Hi @andreekeberg ah great, let me know if I can be of any assistance. The unit tests I added should tell you whether the support for spaces is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants