add initial design for uniform processors + align model#31197
add initial design for uniform processors + align model#31197molbap merged 49 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
cc @amyeroberts for a review with a narrower scope than the parent PR 😁 |
|
cc @amyeroberts, sorry to divide my pings 😅 PR is big and I wanted to split it up, this one should be merge-able and be the basis for the rest, I'll rebase afterwards (and @qubvel welcome if you want to take a look!) It includes the kwargs merging just mentioned in the other PR, moved them to processing common! |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks!
Same comment as for the other PR - it would be good to move out the kwarg prep logic to out of the config and tests to make sure we can properly control tokenizer kwargs with tokenizer.init_kwargs and the input kwargs.
cc @qubvel
qubvel
left a comment
There was a problem hiding this comment.
Nice to see _merge_kwargs as a separate method, this is exactly what came to my mind while reviewing the first PR 🙂
| ``` | ||
| """ | ||
|
|
||
| _defaults = { |
There was a problem hiding this comment.
| _defaults = { | |
| padding: "max_length" | |
| max_lenght: 64 |
should work no? Or does it not update the default for type-hints?
There was a problem hiding this comment.
yes it works for sure, this was to have a structured dict for defaults. Can change :)
There was a problem hiding this comment.
ah, now I remember, it actually can't work like that since Typed Dicts don't support default values, they are made to hold the layout. They can have any attributes however, but it won't pass a value as default -like a dataclass would, but in this case we'd lose typing-, hence the manual operation
There was a problem hiding this comment.
ok got it thanks! Let's maybe comment about this!
There was a problem hiding this comment.
Do we have a comment for future code inspectors? I'm assuming here isn't the best place (we don't want it for all models) but didn't find a corresponding one elsewhere on a quick skim
There was a problem hiding this comment.
On that: there's doc in processing_utils.ProcessingKwargs, I added a comment nudging users to check there for documentation!
|
I have updated the PR description to be more self-contained: by using |
amyeroberts
left a comment
There was a problem hiding this comment.
Looks great!
Just a few small comments
| ``` | ||
| """ | ||
|
|
||
| _defaults = { |
There was a problem hiding this comment.
Do we have a comment for future code inspectors? I'm assuming here isn't the best place (we don't want it for all models) but didn't find a corresponding one elsewhere on a quick skim
|
@amyeroberts FYI Kept digging for the kwargs merging logic, and found an edge case that was giving unreliable results in the tokenizer. Refactored and doubled number of tests to avoid further trickery (including an edge case found earlier by @qubvel ), logic should be easier to read now. Nothing else changed, and tests should pass reliably. |
What does this PR do?
Adds a uniform signature for processors. This PR adds the initial design + one model for the larger #30511.
Usage
As before, kwargs that are passed to processors at
__call__time take priority. However, per-modality processors can be instantiated with their own kwargs, and if they are not overriden at call time, they will serve as defaults.Type hinting of kwargs is preserved if they are passed as structured dictionary entries

It also works with kwargs passed without nesting:

Merging of kwargs and handling priority order is done in
processing_utilsthrough a dedicated method.The order of operations is as follows:
Missing: