-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
German ordinal numbers lead to over splitting #28
Comments
Hi Nick, thanks for the kind words, and glad you find syntok useful. Yes, agreed, for the German month-based date cases, setting up a few rules should be easy, I will try to do that asap. For other ordinals, if we find enough hard data to support a sensible rule (like a number followed by a terminal and a sequence of upper case letters, e.g. "1. FC"), such rules could be added, in theory. You probably have figured out by now, but the last example you showed does not over-split simply because the terminal is followed by a lower-case letter. |
Hey Florian, thank you for your comprehensive explanation. Agree on performance, was trying an ML based tool for this too, which worked out one or two of these, but was much slower. Yes, dates would be a low-hanging fruit, and handle many cases. Also, two succeeding uppercase letters mostly indicate a proper noun in all targeted languages, but it depends what happens more often, in German it's rather difficult to create a (false positive) sentence that ends with a number, while the next one start with a proper noun. "Das ist nun Sieg Nr 3. FC-Köln-Fans sind außer sich." is possible, but sounds a bit made-up 😄 For our case it's not super important to get everything 100% right, as we're feeding an ML tool with raw masses of sentences anyway and a tiny number of wrongly splitted will not cause us any headaches. Thanks! |
I queried the English Wikipedia with the following regex:
Overall in English, with this pattern, I can only find the "1. FC" case that should not be split, but more importantly, a number of cases that should be split.
Therefore, it might be worth elevating the specific expression "1. FC" to a special no-split rule, as well as handling the day of the month, dot, name of month case for German month names. Any other thoughts or ideas? Any good ways to prove a different viewpoint? |
Maybe, one would have to add a simple language detection algorithm to properly solve this case while still being open to any language that uses the Latin alphabet? |
Cool, didn't know one could regex search Wikipedia. The first sentences could easily appear in German too, though.
Therefore, I think the suggested 2-uppercase-following-rule would cause false negatives in any language, that allows starting sentences with the subject without article (and being a proper noun). The month names + dot, though, appear quite often in most Latin alphabet languages and should deserve special treatment. I have to admit, I'm not proficient enough in Python (currently) to write a PR myself. |
No worries, I can do those changes. Only, I have my plate quite full right now, with multiple issues pending. So it might take a week or two until I have a fix for this out there. Hope that's not a problem for you! |
Of course not, I'm grateful this library exists at all! |
Hi, just want to mention that I've also stumbled upon that issue. I have a concern about "simple language detection" algorithm as it can be quite tricky to detect a language. i.e. I would prefer to pass a language as a parameter into sentence segmentation as I already know the language of the sentences I want to split. |
This is a German text containing ordinal numbers. (The original text passed to syntok does not contain
\n
. Just added for readability here).is split into the following parts:
I understand that this is very difficult to get right in German, where an uppercase word can follow the ordinal number.
Dates like
3. Juni
might be maybe detectable, though. Interestingly, the last part does not split atFriedrich II.
, like all other examples.Besides that, syntok seems to be a sublime sentence splitter for German, thank you for this. 🙏
The text was updated successfully, but these errors were encountered: