[TokenClassification] Label realignment for subword aggregation #11622

francescorubbo · 2021-05-07T01:27:40Z

What does this PR do?

Fixes #10263, #10763
See also #10568

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@LysandreJik
@sgugger

What capabilities have been added ?

label realignment: token predictions for subwords can be realigned with 4 different strategies

first (default): the prediction for the first token in the word is assigned to all subword tokens
max: the highest confidence prediction among the subword tokens is assigned to all subword tokens
average: the average pool of the predictions for all subwords is assigned to all subword tokens

What are the expected changes from the current behavior?

New flag aggregation_strategy enables realignment.
Already existing flag ignore_subwords actually enables merging subwords.

Example use cases with code sample enabled by the PR

ner = transformers.pipeline(
    'ner',
    model='elastic/distilbert-base-cased-finetuned-conll03-english',
    tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
    ignore_labels=[],
    ignore_subwords=False,
    aggregation_strategy='average'
)
ner('Mark Musterman')
[
    {
        'word': 'Mark',
        'score': 0.999686598777771,
        'index': 1,
        'start': 0,
        'end': 4,
        'is_subword': False,
        'entity': 'B-PER'
    },
    {
        'word': 'Must',
        'score': 0.9995412826538086,
        'index': 2,
        'start': 5,
        'end': 9,
        'is_subword': False,
        'entity': 'I-PER'
    },
    {
        'word': '##erman',
        'score': 0.9996127486228943,
        'index': 3,
        'start': 9,
        'end': 14,
        'is_subword': True,
        'entity': 'I-PER'
    }
]
ner = transformers.pipeline(
    'ner',
    model='elastic/distilbert-base-cased-finetuned-conll03-english',
    tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
    ignore_labels=[],
    ignore_subwords=True,
    aggregation_strategy='average'
)
ner('Mark Musterman')
[
    {
        'word': 'Mark',
        'score': 0.999686598777771,
        'index': 1,
        'start': 0,
        'end': 4,
        'is_subword': False,
        'entity': 'B-PER'
    },
    {
        'word': 'Musterman',
        'score': 0.9995412826538086,
        'index': 2,
        'start': 5,
        'end': 9,
        'is_subword': False,
        'entity': 'I-PER'
    }
]

Previous use cases with code sample that see the behavior changes

ner = transformers.pipeline(
    'ner',
    model='elastic/distilbert-base-cased-finetuned-conll03-english',
    tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
    ignore_labels=[],
    ignore_subwords=True
)
ner('Mark Musterman')
[
    {
        'word': 'Mark',
        'score': 0.999686598777771,
        'entity': 'B-PER',
        'index': 1,
        'start': 0,
        'end': 4
    },
    {
        'word': 'Must',
        'score': 0.9995412826538086,
        'entity': 'I-PER',
        'index': 2,
        'start': 5,
        'end': 9
    },
    {
        'word': '##erman',
        'score': 0.9996127486228943,
        'entity': 'I-PER',
        'index': 3,
        'start': 9,
        'end': 14
    }
]

sgugger

Thanks for re-opening a clean PR!

sgugger · 2021-05-07T02:01:56Z

tests/test_pipelines_token_classification.py

            ],
        ]

+        ungrouped_inputs_all_scores = [


Should this be in a fixture too?

Good point! Moved to json fixture along w/ the already existing inputs above this.

francescorubbo · 2021-05-08T18:21:26Z

@sgugger the failed test (test_gpt2_model_past_large_inputs) seems unrelated to the changes in this PR. Any thought on what might be going on and how to resolve?

sgugger · 2021-05-09T18:29:26Z

No, it's just flaky, don't worry!

LysandreJik

Thank you for your work! This LGTM.

I'm putting links to relevant comments from the previous PR:

This keeps backwards compatibility #10568 (review)
Completing the work that was started here, left for a future PR #10568 (comment)

@Narsil could you give it a quick look if you have time? It shouldn't change anything to the current pipeline behavior, but offers a cleaner AggregationStrategy as an opt-in.

LysandreJik · 2021-05-10T09:13:57Z

Hey @francescorubbo, @Narsil pointed out a few issues with the current implementation that we'll take a look at today/tomorrow. Namely, the code becomes a bit complex as we keep adding features to that pipeline so it might be time for a slightly larger refactor, and some code is model-specific, such as this line which wouldn't work on non BERT-like tokenizers:

subwords[0]["word"] += "".join([sub["word"].split("##")[1] for sub in subwords[1:]])

We're taking a look at what can be done and will come back to you in a bit. Thanks again for your patience.

github-actions · 2021-06-06T15:06:01Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Label realignment for subword aggregation

85f9be1

sgugger approved these changes May 7, 2021

View reviewed changes

Move test inputs to json fixture.

1d8ef28

LysandreJik approved these changes May 10, 2021

View reviewed changes

Narsil mentioned this pull request May 11, 2021

[TokenClassification] Label realignment for subword aggregation #11680

Merged

5 tasks

francescorubbo closed this Jun 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TokenClassification] Label realignment for subword aggregation #11622

[TokenClassification] Label realignment for subword aggregation #11622

Uh oh!

francescorubbo commented May 7, 2021

Uh oh!

sgugger left a comment

Uh oh!

sgugger May 7, 2021

Uh oh!

francescorubbo May 7, 2021

Uh oh!

francescorubbo commented May 8, 2021

Uh oh!

sgugger commented May 9, 2021

Uh oh!

LysandreJik left a comment

Uh oh!

LysandreJik commented May 10, 2021

Uh oh!

github-actions bot commented Jun 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[TokenClassification] Label realignment for subword aggregation #11622

[TokenClassification] Label realignment for subword aggregation #11622

Uh oh!

Conversation

francescorubbo commented May 7, 2021

What does this PR do?

Before submitting

Who can review?

What capabilities have been added ?

What are the expected changes from the current behavior?

Example use cases with code sample enabled by the PR

Previous use cases with code sample that see the behavior changes

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger May 7, 2021

Choose a reason for hiding this comment

Uh oh!

francescorubbo May 7, 2021

Choose a reason for hiding this comment

Uh oh!

francescorubbo commented May 8, 2021

Uh oh!

sgugger commented May 9, 2021

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik commented May 10, 2021

Uh oh!

github-actions bot commented Jun 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants