Skip to content

Conversation

@LysandreJik
Copy link
Member

As of now the tokenizers output a specific warning when an encoded sequence is longer than the maximum specified sequence length, which is model-specific:

Token indices sequence length is longer than the specified maximum sequence length for this model (X > 1024). Running this sequence through the model will result in indexing errors

It is currently in the convert_tokens_to_ids and this leads to two issues:

This PR aims to slightly change the behavior so that both aforementioned issues may be solved by putting the warning in the prepare_for_model method if no max_length is specified.

@LysandreJik LysandreJik requested a review from thomwolf November 14, 2019 15:39
@LysandreJik LysandreJik changed the title oken indices sequence length is longer than the specified maximum sequence length for this model Token indices sequence length is longer than the specified maximum sequence length for this model Nov 14, 2019
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cleaner indeed!

@thomwolf thomwolf merged commit 0be9ae7 into master Nov 14, 2019
@julien-c julien-c deleted the max-length-warning branch November 16, 2019 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants