-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apostrophes at start or end of word seem to mess up the segmenter #9
Comments
Ooh, this is a tricky one 🙈 Since it assumes the apostrophes form a quoted section and the library doesn't do sentence boundary detection internally to the quoted sections. I've been mulling over rules to detect this situation but it's a hard one. |
I am honestly not sure either. ' at end of word like thinkin' may be treated as normal character. However probably not realistic to tell the difference between a quoted section vs excessive ' usage. |
The trouble is Currently the library doesn't do quote pair detection it just has a set of regexes for probably quoted pairs and naively ignores sentence breaks between pairs, I think. By distinguishing between open and close quotes we'd at least detect the first |
An alternative that just occured is if the initial character following the quote is lowercase, as in |
If first seen quote does not have a preceding whitespace character then could assume it is not start of a quote. For end of quote could use the lowercase solution you mentioned. |
Given this text
When it first arrived, I thought it was huge, and was thinkin' 'bout returning it, even though it is the size they say it is, It just seemed really large in person. I kept it and started using it. It is very easy to use with the instruction manual in hand, and I don't need that anymore for the things I do. I've scanned, copied, enlarged and printed double sided. All verry intuitive now. Prints clean and clear, bought a two pack of extra capacity black ink cartridges from Epson, delivered they were only $37, which I thought was reasonable, and it doesn't even look that big anymore. I am likin' it more all the time, and real happy with my choice.
If I convert the last "likin'" to "likin" then it segments into 2 phrases.
If I convert the first "thinkin'" to "thinkin" it segments to 1 phrase.
If I convert the first "'bout" to "bout" then it segments to 7 phrases.
The text was updated successfully, but these errors were encountered: