Add unit tests for tokenizers and filters #1156

mocobeta · 2021-09-20T10:09:44Z

Some tokenizers/filters seem to have no unit test (e.g. SimpleTokenizer and StopWordFilter).
I think it would be nice to add basic tests for them for future development. To start with, I added a test module for SimpleTokenizer; does it make sense?

PSeitz · 2021-09-20T13:42:19Z

@mocobeta That's a really good idea to cover these with tests

mocobeta · 2021-09-20T14:49:21Z

@PSeitz thanks for your reply. I'll try to add tests for other components.

fulmicoton · 2021-09-23T12:56:04Z

@mocobeta awesome. I'm happy to merge this, but the PR is in draft state.

mocobeta · 2021-09-23T13:30:38Z

Thanks, @fulmicoton. I'd like to include a few more tests for other tokenizers, and then I will open this soon.

src/tokenizer/lower_caser.rs

codecov · 2021-09-24T14:52:21Z

Codecov Report

Merging #1156 (1a9d2d2) into main (2c78b31) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1156      +/-   ##
==========================================
+ Coverage   93.78%   93.86%   +0.08%     
==========================================
  Files         203      203              
  Lines       33654    33994     +340     
==========================================
+ Hits        31561    31910     +349     
+ Misses       2093     2084       -9

Impacted Files	Coverage Δ
src/tokenizer/alphanum_only.rs	`90.00% <100.00%> (+90.00%)`	⬆️
src/tokenizer/lower_caser.rs	`100.00% <100.00%> (ø)`
src/tokenizer/raw_tokenizer.rs	`100.00% <100.00%> (ø)`
src/tokenizer/remove_long.rs	`100.00% <100.00%> (+3.84%)`	⬆️
src/tokenizer/simple_tokenizer.rs	`100.00% <100.00%> (ø)`
src/tokenizer/stop_word_filter.rs	`77.94% <100.00%> (+23.17%)`	⬆️
src/tokenizer/whitespace_tokenizer.rs	`92.45% <100.00%> (+4.21%)`	⬆️
src/indexer/segment_updater.rs	`90.90% <0.00%> (-1.38%)`	⬇️
src/directory/watch_event_router.rs	`95.41% <0.00%> (-0.77%)`	⬇️
src/postings/stacker/expull.rs	`93.44% <0.00%> (-0.44%)`	⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c78b31...1a9d2d2. Read the comment docs.

mocobeta · 2021-09-24T14:53:14Z

I think this covers all existing tokenizers/filters. Tests added here are very basic ones though, I hope this will be of some help.

fulmicoton · 2021-09-27T01:21:55Z

@mocobeta It definitely helps! Thank you!

add unit test for SimpleTokenizer

89e4970

add unit tests for tokenizers and filters.

1a9d2d2

mocobeta commented Sep 24, 2021

View reviewed changes

src/tokenizer/lower_caser.rs Show resolved Hide resolved

mocobeta marked this pull request as ready for review September 24, 2021 14:45

fulmicoton merged commit 74e36c7 into quickwit-oss:main Sep 27, 2021

This was referenced Feb 18, 2022

fix open bytes index PSeitz/tantivy#1

Closed

aggregation PSeitz/tantivy#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unit tests for tokenizers and filters #1156

Add unit tests for tokenizers and filters #1156

mocobeta commented Sep 20, 2021

PSeitz commented Sep 20, 2021

mocobeta commented Sep 20, 2021

fulmicoton commented Sep 23, 2021

mocobeta commented Sep 23, 2021

codecov bot commented Sep 24, 2021 •

edited

Loading

mocobeta commented Sep 24, 2021

fulmicoton commented Sep 27, 2021

Add unit tests for tokenizers and filters #1156

Add unit tests for tokenizers and filters #1156

Conversation

mocobeta commented Sep 20, 2021

PSeitz commented Sep 20, 2021

mocobeta commented Sep 20, 2021

fulmicoton commented Sep 23, 2021

mocobeta commented Sep 23, 2021

codecov bot commented Sep 24, 2021 • edited Loading

Codecov Report

mocobeta commented Sep 24, 2021

fulmicoton commented Sep 27, 2021

codecov bot commented Sep 24, 2021 •

edited

Loading