Question marks at the end swallowed #39

dakinggg · 2019-10-24T16:45:09Z

Looks like the example with just question marks is good now:

>>> segmenter.segment("??")
['??']

but the example with double question marks as a token at the end of a sentence still loses the question marks:

>>> segmenter.segment("T stands for the vector transposition. As shown in Fig. ??")
['T stands for the vector transposition.', 'As shown in Fig.']

looks like this is the minimal repro:

>>> segmenter.segment("Fig. ??")
['Fig.']

The text was updated successfully, but these errors were encountered:

nipunsadvilkar · 2019-10-24T17:43:48Z

@danielkingai2 : Is given example - "Fig. ??" - actual sentence from some text? Seems like acronym with ?? would be less likely. Tried following:

In [2]: seg.segment('Fig. 20??')                                                                                                          
Out[2]: ['Fig.', '20??']

In [3]: seg.segment('Fig. 20 ??')                                                                                                         
Out[3]: ['Fig.', '20 ??']

In [4]: seg.segment('Fig. ??')                                                                                                            
Out[4]: ['Fig.']

I would need to add fig abbreviation to keep above tokens intact. Will look into it

dakinggg · 2019-10-24T17:59:34Z

It can result from text that is parsed from a pdf. The problem is that it becomes challenging to use the output if the output text doesn't match the input text (even if the sentence splitting is wrong, it is better to retain the original text). It looks like it only happens when the ?? is at the end of the input sequence and after a sentence split by pysbd (see examples below). Would it be easier to add a case to handle a sequence ending with ?? rather than special casing the abbreviation?

>>> segmenter.segment("This text talks about Fig. ??. It is a figure.")
['This text talks about Fig.', '??.', 'It is a figure.']
>>> segmenter.segment("This text talks about Fig. ??.")
['This text talks about Fig.', '??.']
>>> segmenter.segment("This text talks about Fig. ?? .")
['This text talks about Fig.', '?? .']
>>> segmenter.segment("This text talks about Fig. ?? which is a figure.")
['This text talks about Fig.', '?? which is a figure.']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ??']
>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.']

nipunsadvilkar · 2019-10-24T18:13:07Z

Yes, I agree.

Can you please try out adding following line after https://github.com/nipunsadvilkar/pySBD/blob/master/pysbd/processor.py#L182

txt = re.sub(r'☇$', '??', txt)

dakinggg · 2019-10-24T18:18:18Z

seems better with that, at least it doesn't truncate the text

>>> segmenter.segment("This text talks about Fig. ??")
['This text talks about Fig.', '??']
>>> segmenter.segment("This text talks about Fig ??")
['This text talks about Fig ?', '?']

nipunsadvilkar · 2019-10-24T18:23:08Z

Yes, I really need to come up with some assertion logic to map respective sentences to the original text. This is the main reason why I've been working on https://github.com/nipunsadvilkar/pySBD/tree/sentence-char-span branch because even if pysbd fails to find proper sentence. tok.is_sent_start would remain False and we will get an original text at the end

dakinggg · 2019-10-24T18:31:05Z

In the meantime, would you mind merging that fix you came up with?

nipunsadvilkar · 2019-10-24T18:33:36Z

Yes, sure!

And sorry for multiple bugs, since I ported it from https://github.com/diasks2/pragmatic_segmenter to python. My main criterion was getting golden rules tests pass.

Some of the issues (better to call it edge-cases) which you created earlier are not accounted for in pragmatic_segmenter. Though, I have tried to fix them in pysbd. Thanks for making pysbd more robust :)

dakinggg · 2019-10-24T18:35:23Z

No worries, I totally understand that you've ported the ruby gem! Definitely appreciate your responsiveness. Working with pdf parsed text causes all kinds of edge cases.

nipunsadvilkar · 2019-10-24T18:37:01Z

Yeah I concur, I myself work with lot of OCR text so know the pain of unformatted text

nipunsadvilkar · 2019-10-25T11:10:51Z

@danielkingai2 : Fixed above bug & have released char-span functionality today. Didn't release it yday since I wanted to add tests and update docs.

dakinggg · 2019-10-25T17:52:26Z

Nice!

nipunsadvilkar added the bug label Oct 24, 2019

nipunsadvilkar closed this as completed in baca57c Oct 25, 2019

nipunsadvilkar added the edge-cases update rules to account for the edge cases label Oct 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question marks at the end swallowed #39

Question marks at the end swallowed #39

dakinggg commented Oct 24, 2019

nipunsadvilkar commented Oct 24, 2019 •

edited

Loading

dakinggg commented Oct 24, 2019 •

edited

Loading

nipunsadvilkar commented Oct 24, 2019

dakinggg commented Oct 24, 2019

nipunsadvilkar commented Oct 24, 2019

dakinggg commented Oct 24, 2019

nipunsadvilkar commented Oct 24, 2019 •

edited

Loading

dakinggg commented Oct 24, 2019

nipunsadvilkar commented Oct 24, 2019

nipunsadvilkar commented Oct 25, 2019

dakinggg commented Oct 25, 2019

Question marks at the end swallowed #39

Question marks at the end swallowed #39

Comments

dakinggg commented Oct 24, 2019

nipunsadvilkar commented Oct 24, 2019 • edited Loading

dakinggg commented Oct 24, 2019 • edited Loading

nipunsadvilkar commented Oct 24, 2019

dakinggg commented Oct 24, 2019

nipunsadvilkar commented Oct 24, 2019

dakinggg commented Oct 24, 2019

nipunsadvilkar commented Oct 24, 2019 • edited Loading

dakinggg commented Oct 24, 2019

nipunsadvilkar commented Oct 24, 2019

nipunsadvilkar commented Oct 25, 2019

dakinggg commented Oct 25, 2019

nipunsadvilkar commented Oct 24, 2019 •

edited

Loading

dakinggg commented Oct 24, 2019 •

edited

Loading

nipunsadvilkar commented Oct 24, 2019 •

edited

Loading