Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing #23

tomaarsen · 2023-08-07T15:29:42Z

Hello!

This is a heads up that (XLM-)RoBERTa-based SpanMarker models require text to be preprocessed to separate punctuation from words:

# ✅
model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .")
# ❌
model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.")

# You can also supply a list of words directly: ✅
model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."])

This is a consequence of the RoBERTa tokenizer distinguishing , and , as different tokens, and the SpanMarker model is only familiar with the , variant.

Another alternative is to use the spaCy integration, which preprocesses the text into words for you!

The (m)BERT-based SpanMarker models do not require this preprocessing.

Tom Aarsen

The text was updated successfully, but these errors were encountered:

stefan-it · 2023-08-11T23:23:12Z

Hey @tomaarsen ,

I have may a possible alternative solution: In Flair we also can construct and predict from sentences that are given by user. For this tokenization problem we use the v1 version of segtok. You could split the input into sentences, but for just tokenizing the word_tokenizer can be used:

https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L210

I think this could easily be added in the Model Hub Inference logic:

https://github.com/huggingface/api-inference-community/blob/main/docker_images/span_marker/app/pipelines/token_classification.py#L35

So inputs could first be tokenized by word_tokenizer. I think that segtok would be a great alternative and more lightweight compared to spaCy.

Another alternative: not just implementing it on the Model Hub side: maybe it can be implemented in model.predict directly 🤔

tomaarsen · 2023-08-12T04:36:15Z

I'll certainly consider this approach, whether with segtok, spaCy or NLTK. The spaCy version is already implemented.

By default, perhaps I can apply the tokenization only if Hello, there. tokenizes differently than Hello , there .?

tomaarsen · 2023-09-29T18:35:12Z

I've discovered that the issue only persisted for XLM-RoBERTa, and I've been able to tackle it in f2edd06!

tomaarsen mentioned this issue Aug 7, 2023

Unexpectedly (bad) predictions? #20

Closed

tomaarsen pinned this issue Aug 7, 2023

tomaarsen linked a pull request Sep 28, 2023 that will close this issue

Heavily improve automatic model card generation + Patch XLM-R #28

Merged

2 tasks

tomaarsen closed this as completed in #28 Sep 29, 2023

tomaarsen unpinned this issue Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing #23

Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing #23

tomaarsen commented Aug 7, 2023 •

edited

Loading

stefan-it commented Aug 11, 2023 •

edited

Loading

tomaarsen commented Aug 12, 2023

tomaarsen commented Sep 29, 2023

Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing #23

Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing #23

Comments

tomaarsen commented Aug 7, 2023 • edited Loading

stefan-it commented Aug 11, 2023 • edited Loading

tomaarsen commented Aug 12, 2023

tomaarsen commented Sep 29, 2023

tomaarsen commented Aug 7, 2023 •

edited

Loading

stefan-it commented Aug 11, 2023 •

edited

Loading