-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question marks at the end swallowed #39
Comments
@danielkingai2 : Is given example - "Fig. ??" - actual sentence from some text? Seems like acronym with
I would need to add |
It can result from text that is parsed from a pdf. The problem is that it becomes challenging to use the output if the output text doesn't match the input text (even if the sentence splitting is wrong, it is better to retain the original text). It looks like it only happens when the ?? is at the end of the input sequence and after a sentence split by pysbd (see examples below). Would it be easier to add a case to handle a sequence ending with ?? rather than special casing the abbreviation?
|
Yes, I agree. Can you please try out adding following line after https://github.com/nipunsadvilkar/pySBD/blob/master/pysbd/processor.py#L182 txt = re.sub(r'☇$', '??', txt) |
seems better with that, at least it doesn't truncate the text
|
Yes, I really need to come up with some assertion logic to map respective sentences to the original text. This is the main reason why I've been working on https://github.com/nipunsadvilkar/pySBD/tree/sentence-char-span branch because even if pysbd fails to find proper sentence. |
In the meantime, would you mind merging that fix you came up with? |
Yes, sure! And sorry for multiple bugs, since I ported it from https://github.com/diasks2/pragmatic_segmenter to python. My main criterion was getting golden rules tests pass. Some of the issues (better to call it edge-cases) which you created earlier are not accounted for in pragmatic_segmenter. Though, I have tried to fix them in pysbd. Thanks for making pysbd more robust :) |
No worries, I totally understand that you've ported the ruby gem! Definitely appreciate your responsiveness. Working with pdf parsed text causes all kinds of edge cases. |
Yeah I concur, I myself work with lot of OCR text so know the pain of unformatted text |
@danielkingai2 : Fixed above bug & have released |
Nice! |
Looks like the example with just question marks is good now:
but the example with double question marks as a token at the end of a sentence still loses the question marks:
looks like this is the minimal repro:
The text was updated successfully, but these errors were encountered: