Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question marks at the end swallowed #39

Closed
dakinggg opened this issue Oct 24, 2019 · 11 comments
Closed

Question marks at the end swallowed #39

dakinggg opened this issue Oct 24, 2019 · 11 comments
Labels
bug edge-cases update rules to account for the edge cases

Comments

@dakinggg
Copy link
Contributor

Looks like the example with just question marks is good now:

>>> segmenter.segment("??")
['??']

but the example with double question marks as a token at the end of a sentence still loses the question marks:

>>> segmenter.segment("T stands for the vector transposition. As shown in Fig. ??")
['T stands for the vector transposition.', 'As shown in Fig.']

looks like this is the minimal repro:

>>> segmenter.segment("Fig. ??")
['Fig.']
@nipunsadvilkar
Copy link
Owner

nipunsadvilkar commented Oct 24, 2019

@danielkingai2 : Is given example - "Fig. ??" - actual sentence from some text? Seems like acronym with ?? would be less likely. Tried following:

In [2]: seg.segment('Fig. 20??')                                                                                                          
Out[2]: ['Fig.', '20??']

In [3]: seg.segment('Fig. 20 ??')                                                                                                         
Out[3]: ['Fig.', '20 ??']

In [4]: seg.segment('Fig. ??')                                                                                                            
Out[4]: ['Fig.']

I would need to add fig abbreviation to keep above tokens intact. Will look into it

@dakinggg
Copy link
Contributor Author

dakinggg commented Oct 24, 2019

It can result from text that is parsed from a pdf. The problem is that it becomes challenging to use the output if the output text doesn't match the input text (even if the sentence splitting is wrong, it is better to retain the original text). It looks like it only happens when the ?? is at the end of the input sequence and after a sentence split by pysbd (see examples below). Would it be easier to add a case to handle a sequence ending with ?? rather than special casing the abbreviation?

>>> segmenter.segment("This text talks about Fig. ??. It is a figure.")
['This text talks about Fig.', '??.', 'It is a figure.']
>>> segmenter.segment("This text talks about Fig. ??.")
['This text talks about Fig.', '??.']
>>> segmenter.segment("This text talks about Fig. ?? .")
['This text talks about Fig.', '?? .']
>>> segmenter.segment("This text talks about Fig. ?? which is a figure.")
['This text talks about Fig.', '?? which is a figure.']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ??']
>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.']

@nipunsadvilkar
Copy link
Owner

Yes, I agree.

Can you please try out adding following line after https://github.com/nipunsadvilkar/pySBD/blob/master/pysbd/processor.py#L182

txt = re.sub(r'☇$', '??', txt)

@dakinggg
Copy link
Contributor Author

seems better with that, at least it doesn't truncate the text

>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.', '??']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ?', '?']

@nipunsadvilkar
Copy link
Owner

Yes, I really need to come up with some assertion logic to map respective sentences to the original text. This is the main reason why I've been working on https://github.com/nipunsadvilkar/pySBD/tree/sentence-char-span branch because even if pysbd fails to find proper sentence. tok.is_sent_start would remain False and we will get an original text at the end

@dakinggg
Copy link
Contributor Author

In the meantime, would you mind merging that fix you came up with?

@nipunsadvilkar
Copy link
Owner

nipunsadvilkar commented Oct 24, 2019

Yes, sure!

And sorry for multiple bugs, since I ported it from https://github.com/diasks2/pragmatic_segmenter to python. My main criterion was getting golden rules tests pass.

Some of the issues (better to call it edge-cases) which you created earlier are not accounted for in pragmatic_segmenter. Though, I have tried to fix them in pysbd. Thanks for making pysbd more robust :)

@dakinggg
Copy link
Contributor Author

No worries, I totally understand that you've ported the ruby gem! Definitely appreciate your responsiveness. Working with pdf parsed text causes all kinds of edge cases.

@nipunsadvilkar
Copy link
Owner

Yeah I concur, I myself work with lot of OCR text so know the pain of unformatted text

@nipunsadvilkar
Copy link
Owner

@danielkingai2 : Fixed above bug & have released char-span functionality today. Didn't release it yday since I wanted to add tests and update docs.

@dakinggg
Copy link
Contributor Author

Nice!

@nipunsadvilkar nipunsadvilkar added the edge-cases update rules to account for the edge cases label Oct 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug edge-cases update rules to account for the edge cases
Projects
None yet
Development

No branches or pull requests

2 participants