Token indices sequence length is longer than the specified maximum sequence length for this model #1833

LysandreJik · 2019-11-14T15:39:05Z

As of now the tokenizers output a specific warning when an encoded sequence is longer than the maximum specified sequence length, which is model-specific:

Token indices sequence length is longer than the specified maximum sequence length for this model (X > 1024). Running this sequence through the model will result in indexing errors

It is currently in the convert_tokens_to_ids and this leads to two issues:

using encode or encode_plus methods with a max_length specified will still output that warning as the convert_tokens_to_ids method is used before the truncation is done. (cf token indices sequence length is longer than the specified maximum sequence length #1791)
since prepare_for_model was introduced, I personally feel that all modifications related to the model should happen in that method and not in tokenize or convert_tokens_to_ids.

This PR aims to slightly change the behavior so that both aforementioned issues may be solved by putting the warning in the prepare_for_model method if no max_length is specified.

thomwolf

LGTM, cleaner indeed!

Reorganized max_len warning

a67e747

LysandreJik requested a review from thomwolf November 14, 2019 15:39

LysandreJik changed the title ~~oken indices sequence length is longer than the specified maximum sequence length for this model~~ Token indices sequence length is longer than the specified maximum sequence length for this model Nov 14, 2019

thomwolf approved these changes Nov 14, 2019

View reviewed changes

thomwolf merged commit 0be9ae7 into master Nov 14, 2019

julien-c deleted the max-length-warning branch November 16, 2019 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token indices sequence length is longer than the specified maximum sequence length for this model #1833

Token indices sequence length is longer than the specified maximum sequence length for this model #1833

Uh oh!

LysandreJik commented Nov 14, 2019

Uh oh!

thomwolf left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Token indices sequence length is longer than the specified maximum sequence length for this model #1833

Token indices sequence length is longer than the specified maximum sequence length for this model #1833

Uh oh!

Conversation

LysandreJik commented Nov 14, 2019

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants