You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, not sure if this is something to do with your models in particular, or just a limitations of certain huggingface base models, but the tokenizers associated with your models for some reason have their model_max_length attribute undefined. This means longer texts will not be truncated to the maximum size of the model (even when passing truncation=True) and the model will then fail with index out of range when accessing the embeddings.
Actually, I've just seen that e.g. unitary/multilingual-toxic-xlm-roberta (also based on xlm roberta) doesn't fail in the same way, probably since it does define the model_max_length in its tokenizer_config.json.
Just in case you want to keep this in mind for possible improvements in your model configs... I'd have thought that huggingface's base models would have such parameters set by default when not explicitly stated, but it seems that's not the case :(
For reference, I now automatically "fix" all huggingface pipelines with a version of the below code:
MAX_SEQUENCE_LENGTH=512"""For now we don't need texts longer than this, ever."""defensure_tokenizer_max_length(tokenizer, model):
"""Ensure tokenizer has a max. length defined (#tokens) at which to truncate. Unfortunately many tokenizers don't seem to have this defined by default, which will lead to failure when using their resulting non-truncated outputs in a model which does have a maximum size. """max_length=getattr(tokenizer, "model_max_length", None)
ifmax_lengthisNoneormax_length>MAX_SEQUENCE_LENGTH:
LOG.warning(f"Tokenizer's model_max_length={max_length} probably wasn't set correctly.")
default_lengths=getattr(tokenizer, "max_model_input_sizes", {})
ifdefault_lengths:
k, v=next(iter(default_lengths.items()))
LOG.warning(f"Found and will use default max length for model {k}={v}.")
max_length=velse:
model_len=model.config.to_dict().get("max_position_embeddings")
ifmodel_lenisnotNone:
LOG.warning(f"Found no default max length but model defines max_position_embeddings={model_len}")
ifmodel_lenin (514, 130):
model_len-=2LOG.warning(f"Corrected max length to be {model_len}.")
max_length=model_lenelse:
LOG.warning(f"Couldn't determine appropriate max length. Will use default of {MAX_SEQUENCE_LENGTH}.")
max_length=MAX_SEQUENCE_LENGTHtokenizer.model_max_length=max_length
The text was updated successfully, but these errors were encountered:
Hi, not sure if this is something to do with your models in particular, or just a limitations of certain huggingface base models, but the tokenizers associated with your models for some reason have their
model_max_length
attribute undefined. This means longer texts will not be truncated to the maximum size of the model (even when passingtruncation=True
) and the model will then fail withindex out of range
when accessing the embeddings.Actually, I've just seen that e.g.
unitary/multilingual-toxic-xlm-roberta
(also based on xlm roberta) doesn't fail in the same way, probably since it does define themodel_max_length
in itstokenizer_config.json
.Just in case you want to keep this in mind for possible improvements in your model configs... I'd have thought that huggingface's base models would have such parameters set by default when not explicitly stated, but it seems that's not the case :(
For reference, I now automatically "fix" all huggingface pipelines with a version of the below code:
The text was updated successfully, but these errors were encountered: