-
Notifications
You must be signed in to change notification settings - Fork 31.9k
[TokenClassification] Label realignment for subword aggregation #11622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TokenClassification] Label realignment for subword aggregation #11622
Conversation
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for re-opening a clean PR!
| ], | ||
| ] | ||
|
|
||
| ungrouped_inputs_all_scores = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in a fixture too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Moved to json fixture along w/ the already existing inputs above this.
|
@sgugger the failed test ( |
|
No, it's just flaky, don't worry! |
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work! This LGTM.
I'm putting links to relevant comments from the previous PR:
- This keeps backwards compatibility #10568 (review)
- Completing the work that was started here, left for a future PR #10568 (comment)
@Narsil could you give it a quick look if you have time? It shouldn't change anything to the current pipeline behavior, but offers a cleaner AggregationStrategy as an opt-in.
|
Hey @francescorubbo, @Narsil pointed out a few issues with the current implementation that we'll take a look at today/tomorrow. Namely, the code becomes a bit complex as we keep adding features to that pipeline so it might be time for a slightly larger refactor, and some code is model-specific, such as this line which wouldn't work on non BERT-like tokenizers: We're taking a look at what can be done and will come back to you in a bit. Thanks again for your patience. |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
Fixes #10263, #10763
See also #10568
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@LysandreJik
@sgugger
What capabilities have been added ?
label realignment: token predictions for subwords can be realigned with 4 different strategies
first(default): the prediction for the first token in the word is assigned to all subword tokensmax: the highest confidence prediction among the subword tokens is assigned to all subword tokensaverage: the average pool of the predictions for all subwords is assigned to all subword tokensWhat are the expected changes from the current behavior?
New flag
aggregation_strategyenables realignment.Already existing flag
ignore_subwordsactually enables merging subwords.Example use cases with code sample enabled by the PR
Previous use cases with code sample that see the behavior changes