Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle irregularities between pySBD & pySBD + spaCy sentence output #59

Closed
nipunsadvilkar opened this issue Feb 12, 2020 · 1 comment
Closed

Comments

@nipunsadvilkar
Copy link
Owner

pySBD spaCy pipeline component uses a token-based approach and sets is_sent_start to True or False depending on Spans obtained from pySBD character offsets. We create Span objects using doc.char_span method by creating a slice - doc.text[start:end] which is a sentence span whose first Token object needs to have attribute is_sent_start set to True. On the other hand, if the character indices don’t map to a valid span it returns None . Hence we get irregularities in pySBD & pySBD + spaCy sentence output.

The inability to get Span object from pySBD character offsets can be tackled using the deconstruction of Doc object like the way PKSHATechnology-Research/camphr authors have written get_doc_char_span which uses destruct_token

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant