Benchmark against pragmatic segmenter #2

Immortalin · 2019-02-16T05:52:54Z

No description provided.

fnl · 2019-02-16T13:24:14Z

Good point; When I developed the first version, segtok, there were no good benchmark datasets for sentence segmentation around that had sufficient coverage of the tricky cases this library can do. That is, all I found were examples of trivial sentence segmentation problems that virtually any statistical tagger can do well on, too. But if someone has a pointer to a really tough test set with stuff like author abbreviations, enumerations, typos, mathematical and scientific content, and/or social domain text (that might be abusing sentence terminal markers), that would be worth adding. Otherwise, I think the 50+ test cases I have collected as examples of such problems are my current "benchmark": I haven't found a single other library that can do all those cases.

fnl · 2019-02-16T13:29:18Z

The above being said, what I am currently not interested in or would have time to do is go compare my library manually against another, case-by-case. So if someone wants to fulfill the specific request made by Immortalin here (or you yourself?), please feel free to make that comparison, though. I am sure either library will have its particular strengths.

But that being said, for an unbiased comparison, what would be more important is an impartial sentence segmentation dataset that covers the more tricky cases we find in the wild.

fnl · 2019-11-11T14:54:51Z

Another interesting tool to compare/benchmark against: https://github.com/nipunsadvilkar/pySBD

Note that pySBD is supposedly based on the Pragmatic Segmenter.

reepush · 2020-03-21T16:41:16Z

For my use case syntok works just perfect. Thanks @fnl for this project!

fnl added the enhancement New feature or request label Feb 16, 2019

fnl added the help wanted Extra attention is needed label Feb 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark against pragmatic segmenter #2

Benchmark against pragmatic segmenter #2

Immortalin commented Feb 16, 2019

fnl commented Feb 16, 2019

fnl commented Feb 16, 2019

fnl commented Nov 11, 2019 •

edited

Loading

reepush commented Mar 21, 2020 •

edited

Loading

Benchmark against pragmatic segmenter #2

Benchmark against pragmatic segmenter #2

Comments

Immortalin commented Feb 16, 2019

fnl commented Feb 16, 2019

fnl commented Feb 16, 2019

fnl commented Nov 11, 2019 • edited Loading

reepush commented Mar 21, 2020 • edited Loading

fnl commented Nov 11, 2019 •

edited

Loading

reepush commented Mar 21, 2020 •

edited

Loading