Missing tokenizer model_max_length #17

buhrmann · 2021-09-21T17:03:11Z

Hi, not sure if this is something to do with your models in particular, or just a limitations of certain huggingface base models, but the tokenizers associated with your models for some reason have their model_max_length attribute undefined. This means longer texts will not be truncated to the maximum size of the model (even when passing truncation=True) and the model will then fail with index out of range when accessing the embeddings.

Actually, I've just seen that e.g. unitary/multilingual-toxic-xlm-roberta (also based on xlm roberta) doesn't fail in the same way, probably since it does define the model_max_length in its tokenizer_config.json.

Just in case you want to keep this in mind for possible improvements in your model configs... I'd have thought that huggingface's base models would have such parameters set by default when not explicitly stated, but it seems that's not the case :(

For reference, I now automatically "fix" all huggingface pipelines with a version of the below code:

MAX_SEQUENCE_LENGTH = 512
"""For now we don't need texts longer than this, ever."""

def ensure_tokenizer_max_length(tokenizer, model):
    """Ensure tokenizer has a max. length defined (#tokens) at which to truncate.

    Unfortunately many tokenizers don't seem to have this defined by default, which will
    lead to failure when using their resulting non-truncated outputs in a model which does
    have a maximum size.
    """
    max_length = getattr(tokenizer, "model_max_length", None)
    if max_length is None or max_length > MAX_SEQUENCE_LENGTH:
        LOG.warning(f"Tokenizer's model_max_length={max_length} probably wasn't set correctly.")

        default_lengths = getattr(tokenizer, "max_model_input_sizes", {})
        if default_lengths:
            k, v = next(iter(default_lengths.items()))
            LOG.warning(f"Found and will use default max length for model {k}={v}.")
            max_length = v
        else:
            model_len = model.config.to_dict().get("max_position_embeddings")
            if model_len is not None:
                LOG.warning(f"Found no default max length but model defines max_position_embeddings={model_len}")
                if model_len in (514, 130):
                    model_len -= 2
                    LOG.warning(f"Corrected max length to be {model_len}.")
                max_length = model_len
            else:
                LOG.warning(f"Couldn't determine appropriate max length. Will use default of {MAX_SEQUENCE_LENGTH}.")
                max_length = MAX_SEQUENCE_LENGTH

    tokenizer.model_max_length = max_length

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing tokenizer model_max_length #17

Missing tokenizer model_max_length #17

buhrmann commented Sep 21, 2021

Missing tokenizer model_max_length #17

Missing tokenizer model_max_length #17

Comments

buhrmann commented Sep 21, 2021