-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing #23
Comments
Hey @tomaarsen , I have may a possible alternative solution: In Flair we also can construct and predict from sentences that are given by user. For this tokenization problem we use the v1 version of https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L210 I think this could easily be added in the Model Hub Inference logic: So Another alternative: not just implementing it on the Model Hub side: maybe it can be implemented in |
I'll certainly consider this approach, whether with By default, perhaps I can apply the tokenization only if |
I've discovered that the issue only persisted for XLM-RoBERTa, and I've been able to tackle it in f2edd06! |
Hello!
This is a heads up that (XLM-)RoBERTa-based SpanMarker models require text to be preprocessed to separate punctuation from words:
This is a consequence of the RoBERTa tokenizer distinguishing
,
and,
as different tokens, and the SpanMarker model is only familiar with the,
variant.Another alternative is to use the spaCy integration, which preprocesses the text into words for you!
The (m)BERT-based SpanMarker models do not require this preprocessing.
The text was updated successfully, but these errors were encountered: