Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit tests for tokenizers and filters #1156

Merged
merged 2 commits into from
Sep 27, 2021

Conversation

mocobeta
Copy link
Contributor

Some tokenizers/filters seem to have no unit test (e.g. SimpleTokenizer and StopWordFilter).
I think it would be nice to add basic tests for them for future development. To start with, I added a test module for SimpleTokenizer; does it make sense?

@PSeitz
Copy link
Contributor

PSeitz commented Sep 20, 2021

@mocobeta That's a really good idea to cover these with tests

@mocobeta
Copy link
Contributor Author

@PSeitz thanks for your reply. I'll try to add tests for other components.

@fulmicoton
Copy link
Collaborator

@mocobeta awesome. I'm happy to merge this, but the PR is in draft state.

@mocobeta
Copy link
Contributor Author

Thanks, @fulmicoton. I'd like to include a few more tests for other tokenizers, and then I will open this soon.

@mocobeta mocobeta marked this pull request as ready for review September 24, 2021 14:45
@codecov
Copy link

codecov bot commented Sep 24, 2021

Codecov Report

Merging #1156 (1a9d2d2) into main (2c78b31) will increase coverage by 0.08%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1156      +/-   ##
==========================================
+ Coverage   93.78%   93.86%   +0.08%     
==========================================
  Files         203      203              
  Lines       33654    33994     +340     
==========================================
+ Hits        31561    31910     +349     
+ Misses       2093     2084       -9     
Impacted Files Coverage Δ
src/tokenizer/alphanum_only.rs 90.00% <100.00%> (+90.00%) ⬆️
src/tokenizer/lower_caser.rs 100.00% <100.00%> (ø)
src/tokenizer/raw_tokenizer.rs 100.00% <100.00%> (ø)
src/tokenizer/remove_long.rs 100.00% <100.00%> (+3.84%) ⬆️
src/tokenizer/simple_tokenizer.rs 100.00% <100.00%> (ø)
src/tokenizer/stop_word_filter.rs 77.94% <100.00%> (+23.17%) ⬆️
src/tokenizer/whitespace_tokenizer.rs 92.45% <100.00%> (+4.21%) ⬆️
src/indexer/segment_updater.rs 90.90% <0.00%> (-1.38%) ⬇️
src/directory/watch_event_router.rs 95.41% <0.00%> (-0.77%) ⬇️
src/postings/stacker/expull.rs 93.44% <0.00%> (-0.44%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c78b31...1a9d2d2. Read the comment docs.

@mocobeta
Copy link
Contributor Author

I think this covers all existing tokenizers/filters. Tests added here are very basic ones though, I hope this will be of some help.

@fulmicoton
Copy link
Collaborator

@mocobeta It definitely helps! Thank you!

@fulmicoton fulmicoton merged commit 74e36c7 into quickwit-oss:main Sep 27, 2021
This was referenced Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants