Support tokenization_kwargs override#29794
Conversation
Signed-off-by: piood <2477084691@qq.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request adds support for overriding tokenization behavior by introducing a tokenization_kwargs parameter to the embedding, classification, and scoring APIs. The changes are well-implemented by plumbing this new parameter through the call stack. However, I've found a critical issue in the test utility function get_inputs where the passed processor_kwargs dictionary is modified in-place within a loop, which could lead to incorrect test behavior. I've provided a code suggestion to fix this.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for overriding tokenization behavior by introducing a tokenization_kwargs parameter in various API methods. The changes are well-implemented for the most part, correctly plumbing the new parameter through the call stack. The tests have also been updated to cover the new functionality. However, I've identified a potential issue in _cross_encoding_score where tokenization_kwargs is used in two different places in the call chain, which could lead to confusion and bugs. My review comment details this concern.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for tokenization_kwargs to customize tokenization behavior in embedding, classification, and scoring APIs. The changes correctly plumb the new parameter through the different layers of the application. The tests have also been updated to cover the new functionality. I've found a couple of high-severity issues where the user-provided tokenization_kwargs dictionary is modified in-place, which can lead to unexpected side effects. I've provided suggestions to fix this by creating a copy of the dictionary.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for overriding tokenization behavior via tokenization_kwargs. The changes are well-implemented across the embedding and classification APIs. However, I've identified a few critical issues where passing conflicting keys in tokenization_kwargs could lead to a TypeError. I've also noted a high-severity issue where multimodal arguments could be dropped during cross-encoder scoring.
Additionally, the public score() method was not updated to accept tokenization_kwargs, which makes this new feature inaccessible for scoring tasks. Please consider adding this parameter to the score() method and passing it to the internal _embedding_score and _cross_encoding_score methods to complete the feature.
Signed-off-by: piood <2477084691@qq.com>
Signed-off-by: piood <2477084691@qq.com>
|
All tests still pass locally after implementing the suggested fixes. Please review it, thanks! @DarkLight1337 |
2caeb7c to
bdfb80f
Compare
DarkLight1337
left a comment
There was a problem hiding this comment.
Thanks and sorry for the delay!
Signed-off-by: piood <2477084691@qq.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
Add support for tokenization_kwargs parameter in embedding, classification, and scoring APIs to allow users to override tokenization behavior during inference.
This enhancement enables users to customize tokenization parameters (e.g., padding, max_length, truncation) when calling embedding-related methods (embed(), classify(), score()), providing more flexibility for handling different input scenarios.
Fix #27566 (comment)
Test Plan
Run the SiGLIP multimodal pooling tests with custom tokenization parameters:
pytest tests/models/multimodal/pooling/test_siglip.pyThe test suite now validates:
Test Result
All SiGLIP pooling tests pass successfully with the following configurations:
✅ test_models_text - validates custom tokenization_kwargs (padding="max_length", max_length=64)
✅ test_models_image - validates default behavior with empty tokenization_kwargs