Skip to content

Conversation

@francescorubbo
Copy link
Contributor

What does this PR do?

Fixes #10263, #10763
See also #10568

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@LysandreJik
@sgugger

What capabilities have been added ?

label realignment: token predictions for subwords can be realigned with 4 different strategies

first (default): the prediction for the first token in the word is assigned to all subword tokens
max: the highest confidence prediction among the subword tokens is assigned to all subword tokens
average: the average pool of the predictions for all subwords is assigned to all subword tokens

What are the expected changes from the current behavior?

New flag aggregation_strategy enables realignment.
Already existing flag ignore_subwords actually enables merging subwords.

Example use cases with code sample enabled by the PR

ner = transformers.pipeline(
    'ner',
    model='elastic/distilbert-base-cased-finetuned-conll03-english',
    tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
    ignore_labels=[],
    ignore_subwords=False,
    aggregation_strategy='average'
)
ner('Mark Musterman')
[
    {
        'word': 'Mark',
        'score': 0.999686598777771,
        'index': 1,
        'start': 0,
        'end': 4,
        'is_subword': False,
        'entity': 'B-PER'
    },
    {
        'word': 'Must',
        'score': 0.9995412826538086,
        'index': 2,
        'start': 5,
        'end': 9,
        'is_subword': False,
        'entity': 'I-PER'
    },
    {
        'word': '##erman',
        'score': 0.9996127486228943,
        'index': 3,
        'start': 9,
        'end': 14,
        'is_subword': True,
        'entity': 'I-PER'
    }
]
ner = transformers.pipeline(
    'ner',
    model='elastic/distilbert-base-cased-finetuned-conll03-english',
    tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
    ignore_labels=[],
    ignore_subwords=True,
    aggregation_strategy='average'
)
ner('Mark Musterman')
[
    {
        'word': 'Mark',
        'score': 0.999686598777771,
        'index': 1,
        'start': 0,
        'end': 4,
        'is_subword': False,
        'entity': 'B-PER'
    },
    {
        'word': 'Musterman',
        'score': 0.9995412826538086,
        'index': 2,
        'start': 5,
        'end': 9,
        'is_subword': False,
        'entity': 'I-PER'
    }
]

Previous use cases with code sample that see the behavior changes

ner = transformers.pipeline(
    'ner',
    model='elastic/distilbert-base-cased-finetuned-conll03-english',
    tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
    ignore_labels=[],
    ignore_subwords=True
)
ner('Mark Musterman')
[
    {
        'word': 'Mark',
        'score': 0.999686598777771,
        'entity': 'B-PER',
        'index': 1,
        'start': 0,
        'end': 4
    },
    {
        'word': 'Must',
        'score': 0.9995412826538086,
        'entity': 'I-PER',
        'index': 2,
        'start': 5,
        'end': 9
    },
    {
        'word': '##erman',
        'score': 0.9996127486228943,
        'entity': 'I-PER',
        'index': 3,
        'start': 9,
        'end': 14
    }
]

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for re-opening a clean PR!

],
]

ungrouped_inputs_all_scores = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in a fixture too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Moved to json fixture along w/ the already existing inputs above this.

@francescorubbo
Copy link
Contributor Author

@sgugger the failed test (test_gpt2_model_past_large_inputs) seems unrelated to the changes in this PR. Any thought on what might be going on and how to resolve?

@sgugger
Copy link
Collaborator

sgugger commented May 9, 2021

No, it's just flaky, don't worry!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work! This LGTM.

I'm putting links to relevant comments from the previous PR:

@Narsil could you give it a quick look if you have time? It shouldn't change anything to the current pipeline behavior, but offers a cleaner AggregationStrategy as an opt-in.

@LysandreJik
Copy link
Member

Hey @francescorubbo, @Narsil pointed out a few issues with the current implementation that we'll take a look at today/tomorrow. Namely, the code becomes a bit complex as we keep adding features to that pipeline so it might be time for a slightly larger refactor, and some code is model-specific, such as this line which wouldn't work on non BERT-like tokenizers:

subwords[0]["word"] += "".join([sub["word"].split("##")[1] for sub in subwords[1:]])

We're taking a look at what can be done and will come back to you in a bit. Thanks again for your patience.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NER label re-alignment always expects B labelled first sub-words

3 participants