Handle irregularities between pySBD & pySBD + spaCy sentence output #59

nipunsadvilkar · 2020-02-12T10:01:46Z

pySBD spaCy pipeline component uses a token-based approach and sets is_sent_start to True or False depending on Spans obtained from pySBD character offsets. We create Span objects using doc.char_span method by creating a slice - doc.text[start:end] which is a sentence span whose first Token object needs to have attribute is_sent_start set to True. On the other hand, if the character indices don’t map to a valid span it returns None . Hence we get irregularities in pySBD & pySBD + spaCy sentence output.

The inability to get Span object from pySBD character offsets can be tackled using the deconstruction of Doc object like the way PKSHATechnology-Research/camphr authors have written get_doc_char_span which uses destruct_token

The text was updated successfully, but these errors were encountered:

Fixes #49, #53, #55 , #59

nipunsadvilkar · 2020-06-09T17:07:21Z

Fixed #63

nipunsadvilkar added bug enhancement labels Feb 12, 2020

nipunsadvilkar mentioned this issue Feb 12, 2020

Different segmentation with Spacy and when using pySBD directly #55

Closed

nipunsadvilkar pinned this issue Mar 3, 2020

nipunsadvilkar mentioned this issue May 26, 2020

✨ 💫 sent char_span through with spaCy & regex & ♻️ Refactoring for more languages support #63

Merged

4 tasks

nipunsadvilkar added a commit that referenced this issue May 29, 2020

🎨 ✅ Add tests for resolved issues

68dc962

Fixes #49, #53, #55 , #59

nipunsadvilkar closed this as completed Jun 9, 2020

nipunsadvilkar unpinned this issue Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle irregularities between pySBD & pySBD + spaCy sentence output #59

Handle irregularities between pySBD & pySBD + spaCy sentence output #59

nipunsadvilkar commented Feb 12, 2020

nipunsadvilkar commented Jun 9, 2020

Handle irregularities between pySBD & pySBD + spaCy sentence output #59

Handle irregularities between pySBD & pySBD + spaCy sentence output #59

Comments

nipunsadvilkar commented Feb 12, 2020

nipunsadvilkar commented Jun 9, 2020