Uniformize kwargs for chameleon processor#32181
Uniformize kwargs for chameleon processor#32181yonigozlan merged 12 commits intohuggingface:mainfrom
Conversation
b184e46 to
2f4163a
Compare
|
@zucchini-nlp this should now also be ready for review |
molbap
left a comment
There was a problem hiding this comment.
Thanks for the contribution! Just one extra default I don't get other than that looks fine
zucchini-nlp
left a comment
There was a problem hiding this comment.
Great job! I think we have to swap args order and it will be ready to merge
| if isinstance(component_class_name, tuple): | ||
| if "_fast" in component_class_name[0]: | ||
| component_class_name = component_class_name[0] | ||
| else: | ||
| component_class_name = component_class_name[1] |
There was a problem hiding this comment.
Same question as in the other PR, why we need to overwrite and look for fastTokenizer? Or is it FastImageProcessor
There was a problem hiding this comment.
some of the common tests error-out with the base tokenizer
I've yet to investigate why, but it's likely unrelated to this PR
There was a problem hiding this comment.
Yes, nice to see what exactly is causing the error, in case there is a bug in tokenizers
There was a problem hiding this comment.
ah, I remember now
The slow, LlamaTokenizer expects the vocab file to be present but it's neither in the official repo nor does it get saved to the temp dir when we do processor.save_pretrained(self.tmpdirname) in setUp
imma add this as a comment
There was a problem hiding this comment.
Ah right, chameleon never had a slow tokenizer. Oke, in that case we can and prob should remove an option for slow tokenizer in chameleon here so that the tuple is (None, FastTokenizer)
transformers/src/transformers/models/auto/tokenization_auto.py
Lines 111 to 115 in 85345bb
And then add a check for Noneness in general get_component
There was a problem hiding this comment.
just removed Chameleon's slow tokenizer & removed the custom get_component in ChameleonProcessorTest (we don't need the extra check cuz there's only one tokenizer left
tests/test_processing_common.py
Outdated
| images=image_input, | ||
| return_tensors="pt", | ||
| size={"height": 214, "width": 214}, | ||
| crop_size={"height": 214, "width": 214}, |
There was a problem hiding this comment.
@yonigozlan i think you removed crop_size from common tests and it had smth to do with some image processors accepting/not accepting certain kwargs?
There was a problem hiding this comment.
@zucchini-nlp Yes but actually it would be nice to have both here. @molbap had some CI tests crash because crop_size was removed here and the image_processor had do_center_crop set to True by default which canceled out size. Having both would handle cases where either do_center_crop is set to True in the image_processor by default, or crop_size is not supported by the image_processor.
So I am for keeping this and merging this PR before some other kwargs uniformization PRs
There was a problem hiding this comment.
btw @zucchini-nlp we might need to increase prio for this PR because of this
I have this change in my other PR too, but I forgot we haven't merged it yet
There was a problem hiding this comment.
Sorry, I was out for a while. Yes, I think some other contributor also reported the issue and wanted to open a PR to fix the conversion script. Feel free to open a PR if there isn't any, as this issue isn't at all related to processor kwargs
0dbc570 to
b252643
Compare
b252643 to
5082630
Compare
|
Thanks so much for your contribution @leloykun ! This PR was a bit outdated compared to main so I rebased it and modified some other nits, but otherwise it seems all good to me. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| "LlamaTokenizer" if is_sentencepiece_available() else None, | ||
| "LlamaTokenizerFast" if is_tokenizers_available() else None, | ||
| ), | ||
| (None, "LlamaTokenizerFast" if is_tokenizers_available() else None), |
There was a problem hiding this comment.
Why remove the slow tokenizer here?
There was a problem hiding this comment.
There was a vocab file missing if I understood correctly, but I will see if it can be added back
|
|
||
| attributes = ["image_processor", "tokenizer"] | ||
| tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast") | ||
| tokenizer_class = "LlamaTokenizerFast" |
There was a problem hiding this comment.
Same here - why remove the slow tokenizer option?
| return_tensors (`str` or [`~utils.TensorType`], *optional*): | ||
| If set, will return tensors of a particular framework. Acceptable values are: | ||
|
|
||
| - `'tf'`: Return TensorFlow `tf.constant` objects. | ||
| - `'pt'`: Return PyTorch `torch.Tensor` objects. | ||
| - `'np'`: Return NumPy `np.ndarray` objects. | ||
| - `'jax'`: Return JAX `jnp.ndarray` objects. |
There was a problem hiding this comment.
This should stay in the docstring for the moment, as it's required for users to get the right output from the processor to pass to the model
There was a problem hiding this comment.
There's no custom tests but it still inherits the tests from ProcessorTesterMixin
There was a problem hiding this comment.
Sorry, I reviewed too quickly and thought this was a scrap file. We should keep and:
- Update the checkpoint
- Add a copyright header
|
the changes look good to me thanks for the help @yonigozlan!! |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for updating!
* uniformize kwargs of Chameleon * fix linter nit * rm stride default * add tests for chameleon processor * fix tests * add comment on get_component * rm Chameleon's slow tokenizer * add check order images text + nit * update docs and tests * Fix LlamaTokenizer tests * fix gated repo access * fix wrong import --------- Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
What does this PR do?
Uniformizes kwargs of Chameleon processors as discussed in #31911
Currently a draft. Will set as ready for review once this PR gets merged: #32013The other PR will take longer to complete, but this can now be merged.Fixes # (issue)
Who can review?
@zucchini-nlp @molbap