Tokenize spaces when explicitly requested and reduce iterations in occurrence counting loop #6

colin-brown · 2021-08-04T10:32:33Z

The original tokenize method assembled a sequence of words using the space character as a delimiter, then counted tokens using split() on the space character. This fails if there are spaces in the words passed to the tokenizer. This is not normally the case when used with a string or array of words, but in the case where you want to use single characters as your tokens, spaces should be supported. This allows cosine comparisons to be done on character arrays and ensures results match other cosine distance implementations.

The new test added demonstrates this (passes with this new code, fails with the original).

Additionally, the original loop that counts occurrences of each nGram looped through all words in every segment. This could be quite inefficient. This version stops looping once nGramMax tokens have been counted and is up to 10x faster when comparing arrays of around 100 words.

…ed as expected. Improved tokenizer to 1. create tokens for spaces when supplied array includes them 2. optimize occurance counting loop so that iteration stops once nGramMax tokens acquired

colin-brown · 2021-08-04T10:34:18Z

This is my first time using github to fork and make a pull request, so please let me know if I have done anything wrong or if you have any questions about the changes.

andreekeberg · 2021-08-28T20:21:27Z

Hi @colin-brown, just updating this to let you know your pull request has been noted and is very appreciated! I'm currently working on a new version of the library with some other updates to the tokenizer and will be reviewing your contribution and try to merge this with the new version as soon as possible.

If I have any questions while reviewing and/or merging with my other updates, I will let you know in this thread!

Once again, thank you for your contribution, it's incredibly fun to see how much interest this project has generated, something I didn't expect while building the first version!

colin-brown · 2021-09-04T14:58:57Z

Hi @andreekeberg ah great, let me know if I can be of any assistance. The unit tests I added should tell you whether the support for spaces is working.

Added test to ensure character arrays that include spaces are tokeniz…

ea8e88f

…ed as expected. Improved tokenizer to 1. create tokens for spaces when supplied array includes them 2. optimize occurance counting loop so that iteration stops once nGramMax tokens acquired

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize spaces when explicitly requested and reduce iterations in occurrence counting loop #6

Tokenize spaces when explicitly requested and reduce iterations in occurrence counting loop #6

colin-brown commented Aug 4, 2021

colin-brown commented Aug 4, 2021

andreekeberg commented Aug 28, 2021

colin-brown commented Sep 4, 2021

Tokenize spaces when explicitly requested and reduce iterations in occurrence counting loop #6

Are you sure you want to change the base?

Tokenize spaces when explicitly requested and reduce iterations in occurrence counting loop #6

Conversation

colin-brown commented Aug 4, 2021

colin-brown commented Aug 4, 2021

andreekeberg commented Aug 28, 2021

colin-brown commented Sep 4, 2021