Skip to content

Conversation

@byi8220
Copy link
Contributor

@byi8220 byi8220 commented Mar 5, 2024

What does this PR do?

Fixes #28591

This PR does the following:

  1. Default the value of IdeficsProcessor.__call__()s padding param to 'longest' instead of False.
  2. Remove the hardcoded text padding logic in that same function, and fully trust the tokenizer will correctly pad.
  3. Add assertions for correctness of attention masks in unit test IdeficsProcessorTest.test_tokenizer_padding
  4. Created a IdeficsProcessorTest.test_tokenizer_left_padding which is a copy of IdeficsProcessorTest except using default tokenizer (which left pads). This could have been parameterized but it felt fine to just duplicate it.

Context

IIUC, it seems like this code is currently always hand-performing an additional padding pass after the tokenizer performs it's padding, according to this code: https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics/processing_idefics.py#L348-L354

And since the output tensors are stacked together in this code, they all have to be of the same size, i.e. padded or otherwise normalized.

Note: Since this changes the default padding, now anyone who calls this function with an explicit value of padding=False will likely encounter an error.

Testing

Unit tests ran with pytest ./tests/models/idefics/ pass. (129 passed, 115 skipped, 8 warnings in 19.22s)

Did not run the slow tests since, running all tests with RUN_SLOW=yes enabled crashes my workstation.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker
@amyeroberts

@ArthurZucker
Copy link
Collaborator

Thanks for the PR! Reviewing asap

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bri25yu, thanks for working on this!

Making the padding logic be consistent within the processor is a good idea, but I don't think we should change the default behaviour.

self,
prompts: Union[List[TextInput], List[List[TextInput]]],
padding: Union[bool, str, PaddingStrategy] = False,
padding: Union[bool, str, PaddingStrategy] = "longest",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change the default here for two reasons:

  • It doesn't match the default behaviour for most processing classes
  • It changes the default behaviour, which can be considered a breaking change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bri25yu

You may have tagged the wrong person 🙃

It doesn't match the default behaviour for most processing classes

Does the idefics model support non-padded inputs? From my understanding of the original issue, it seems they desire expect some padding even when the argument is not passed.

It changes the default behaviour, which can be considered a breaking change

I'm not sure if this is a bug, but the default behavior appears to have been inaccurate to begin with. Even if the user passes in padding=False, lines 347-354 seems to be forcibly padding the text input to the maximum sequence length anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have tagged the wrong person 🙃

Woops, yes, sorry about that

Does the idefics model support non-padded inputs? From my understanding of the original issue, it seems they desire expect some padding even when the argument is not passed.

No, but no models support non-padded inputs if batch_size > 1 and the input sequences are of different lengths, but all processors and tokenizers do not pad the inputs by default.

I'm not sure if this is a bug, but the default behavior appears to have been inaccurate to begin with. Even if the user passes in padding=False, lines 347-354 seems to be forcibly padding the text input to the maximum sequence length anyways.

In this case, the forcible padding when padding=False should be removed

Copy link
Contributor Author

@byi8220 byi8220 Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the forcible padding when padding=False should be removed

Yes, this behavior was removed as part of this PR. After this PR, padding=False or not setting padding will not pad the input.

I have modified the PR to default to padding=False, and the unit tests (and one of the integration tests) to explicitly specify padding='longest'. I guess my only concern was that by changing this behavior, a user which was relying on the default behavior would find their code broken overnight.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my only concern was that by changing this behavior, a user which was relying on the default behavior would find their code broken overnight.

@byi8220 Hmm, yes, this is tricky and that's a good point. OK, in this case, I think your original solution of setting padding='longest' as default is best, ideally with a comment linking to this issue to explain why the default is different and adding a description for the False option in the docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ArthurZucker ArthurZucker changed the title Attention mask bug with padding [ProcessingIdefics] Attention mask bug with padding Mar 19, 2024
Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and adding tests!

@amyeroberts amyeroberts merged commit 75b76a5 into huggingface:main Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Idefics - AttentionMasks wrongly set with padding='longest'

3 participants