Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate tokenizer from hasher #162

Merged
merged 43 commits into from
Mar 5, 2018

Conversation

piroor
Copy link
Contributor

@piroor piroor commented Jun 30, 2017

Now I'm trying to separate tokenizing operations from the hasher, as the first step for #131. I introduced these new modules and classes:

  • Tokenizer::Whitespace
  • Tokenizer::Token
  • TokenFilter::Stopword
  • TokenFilter::Stemmer

For testability and flexibility, they are stayed separated for now. Next step, I'm planning to introduce some mechanism to switch the tokenizer and related modules.

How about this approach?

@Ch4s3
Copy link
Member

Ch4s3 commented Jul 31, 2017

Sorry for the delay. I'll try to take a look tomorrow.

@Ch4s3
Copy link
Member

Ch4s3 commented Nov 9, 2017

I'm super, behind, but I'll get to this soon.

@Ch4s3
Copy link
Member

Ch4s3 commented Nov 20, 2017

I finally took a deeper look at this. It looks really nice! If you're interested in continuing, I'll happily merge this once it's done and documented.

@piroor
Copy link
Contributor Author

piroor commented Nov 21, 2017

Thank you for reviewing! Currently my hands are full, so I'll write document for this in the next month.

@Ch4s3
Copy link
Member

Ch4s3 commented Nov 21, 2017

@piroor no worries, I can relate. I'll probably roll out a release soon, and catch this PR on the next one.

@ibnesayeed
Copy link
Contributor

I will keep an eye on this and will look into the code later when I get some cycles to spare.

@Ch4s3
Copy link
Member

Ch4s3 commented Dec 15, 2017

Thanks @ibnesayeed

@piroor piroor force-pushed the separate-tokenizer-from-hasher branch from 01528ca to 67caa82 Compare January 16, 2018 05:23
@piroor
Copy link
Contributor Author

piroor commented Jan 16, 2018

Sorry for this large delay. I added descriptions for newly introduced (separated) classes and modules.

Moreover, I've added more changes to make tokenizer and filters customizable. Usage of new options are added to docs/bayes.md.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 16, 2018

@piroor I'll take a look tomorrow.

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 2, 2018

This looks pretty good overall. I need to dig in a bit more once we handle #172 in the next day or so. I'll try to target this for a 2.3 release in the next week.

Thanks for you patience!

@Ch4s3 Ch4s3 mentioned this pull request Mar 2, 2018
@piroor piroor changed the title Separate tokenizer from hasher (WIP) Separate tokenizer from hasher Mar 2, 2018
@ibnesayeed
Copy link
Contributor

Finally, I got a chance to look at it today. It is generally looking good to me except a few places where passing a method would have been easier, but a module is required instead. For example, the :tokenizer and token_filters options could just accept their corresponding methods rather than a module that implements those methods with very specific names. Having some default implementation in modules is still fine as long as we pass methods rather than the modules like below:

filters = [
  CatFilter.filter,
  ClassifierReborn::TokenFilters::Stopword.filter,
]
classifier = ClassifierReborn::Bayes.new tokenizer: BigramTokenizer.tokenize, token_filters: filters

This signature will make it easier to write an inline custom tokenizer or filter, while more complex ones can be wrapped in a module when necessary.

@piroor
Copy link
Contributor Author

piroor commented Mar 5, 2018

@ibnesayeed the code you suggested won't work as you expected, because

filters = [
  CatFilter.filter,
  ClassifierReborn::TokenFilters::Stopword.filter,
]

the filters are not array of methods themselves, it is an array of returned values from those methods.

But I agree that the option should accept lambda. So I think I should rename both fixed method name tokenize and filter to call, then the option can accept both module and lambda.

@piroor
Copy link
Contributor Author

piroor commented Mar 5, 2018

After the commit 958d3a0, now :tokenizer and :token_filters options accept lambda.

@piroor piroor force-pushed the separate-tokenizer-from-hasher branch from 3eeae4e to 81824f5 Compare March 5, 2018 06:24
@Ch4s3
Copy link
Member

Ch4s3 commented Mar 5, 2018

This looks good to me

@ibnesayeed
Copy link
Contributor

The code LGTM! (I have not tested it though).

@Ch4s3 Ch4s3 merged commit 605b261 into jekyll:master Mar 5, 2018
@piroor
Copy link
Contributor Author

piroor commented Mar 5, 2018

Thanks!!

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 5, 2018

Thanks for the contribution!


filters = [
CatFilter,
white_filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piroor, are we missing a comma at the end here? Also, the last element of the array has an unnecessary comma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants