Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing tokenizer model_max_length #17

Open
buhrmann opened this issue Sep 21, 2021 · 0 comments
Open

Missing tokenizer model_max_length #17

buhrmann opened this issue Sep 21, 2021 · 0 comments

Comments

@buhrmann
Copy link

Hi, not sure if this is something to do with your models in particular, or just a limitations of certain huggingface base models, but the tokenizers associated with your models for some reason have their model_max_length attribute undefined. This means longer texts will not be truncated to the maximum size of the model (even when passing truncation=True) and the model will then fail with index out of range when accessing the embeddings.

Actually, I've just seen that e.g. unitary/multilingual-toxic-xlm-roberta (also based on xlm roberta) doesn't fail in the same way, probably since it does define the model_max_length in its tokenizer_config.json.

Just in case you want to keep this in mind for possible improvements in your model configs... I'd have thought that huggingface's base models would have such parameters set by default when not explicitly stated, but it seems that's not the case :(

For reference, I now automatically "fix" all huggingface pipelines with a version of the below code:

MAX_SEQUENCE_LENGTH = 512
"""For now we don't need texts longer than this, ever."""

def ensure_tokenizer_max_length(tokenizer, model):
    """Ensure tokenizer has a max. length defined (#tokens) at which to truncate.

    Unfortunately many tokenizers don't seem to have this defined by default, which will
    lead to failure when using their resulting non-truncated outputs in a model which does
    have a maximum size.
    """
    max_length = getattr(tokenizer, "model_max_length", None)
    if max_length is None or max_length > MAX_SEQUENCE_LENGTH:
        LOG.warning(f"Tokenizer's model_max_length={max_length} probably wasn't set correctly.")

        default_lengths = getattr(tokenizer, "max_model_input_sizes", {})
        if default_lengths:
            k, v = next(iter(default_lengths.items()))
            LOG.warning(f"Found and will use default max length for model {k}={v}.")
            max_length = v
        else:
            model_len = model.config.to_dict().get("max_position_embeddings")
            if model_len is not None:
                LOG.warning(f"Found no default max length but model defines max_position_embeddings={model_len}")
                if model_len in (514, 130):
                    model_len -= 2
                    LOG.warning(f"Corrected max length to be {model_len}.")
                max_length = model_len
            else:
                LOG.warning(f"Couldn't determine appropriate max length. Will use default of {MAX_SEQUENCE_LENGTH}.")
                max_length = MAX_SEQUENCE_LENGTH

    tokenizer.model_max_length = max_length
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant