Skip to content

Conversation

@elk-cloner
Copy link
Contributor

@elk-cloner elk-cloner commented Mar 6, 2021

What does this PR do?

Fixes #10263

Before submitting

Who can review?

What capabilities have been added ?

label realignment: token predictions for subwords can be realigned with 4 different strategies

  • default: reset all subword token predictions except for first token
  • first: the prediction for the first token in the word is assigned to all subword tokens
  • max: the highest confidence prediction among the subword tokens is assigned to all subword tokens
  • average: the average pool of the predictions for all subwords is assigned to all subword tokens
  • ignore subwords: enable ignoring subwords by merging tokens

What are the expected changes from the current behavior?

  • New flag subword_label_re_alignment enables realignment.
  • Already existing flag ignore_subwords actually enables merging subwords.

Example use cases with code sample enabled by the PR

ner = transformers.pipeline(
    'ner',
    model='elastic/distilbert-base-cased-finetuned-conll03-english',
    tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
    ignore_labels=[],
    ignore_subwords=False,
    subword_label_re_alignment='average'
)
ner('Mark Musterman')
[
    {
        'word': 'Mark',
        'score': 0.999686598777771,
        'index': 1,
        'start': 0,
        'end': 4,
        'is_subword': False,
        'entity': 'B-PER'
    },
    {
        'word': 'Must',
        'score': 0.9995412826538086,
        'index': 2,
        'start': 5,
        'end': 9,
        'is_subword': False,
        'entity': 'I-PER'
    },
    {
        'word': '##erman',
        'score': 0.9996127486228943,
        'index': 3,
        'start': 9,
        'end': 14,
        'is_subword': True,
        'entity': 'I-PER'
    }
]
ner = transformers.pipeline(
    'ner',
    model='elastic/distilbert-base-cased-finetuned-conll03-english',
    tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
    ignore_labels=[],
    ignore_subwords=True,
    subword_label_re_alignment='average'
)
ner('Mark Musterman')
[
    {
        'word': 'Mark',
        'score': 0.999686598777771,
        'index': 1,
        'start': 0,
        'end': 4,
        'is_subword': False,
        'entity': 'B-PER'
    },
    {
        'word': 'Musterman',
        'score': 0.9995412826538086,
        'index': 2,
        'start': 5,
        'end': 9,
        'is_subword': False,
        'entity': 'I-PER'
    }
]

Previous use cases with code sample that see the behavior changes

ner = transformers.pipeline(
    'ner',
    model='elastic/distilbert-base-cased-finetuned-conll03-english',
    tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
    ignore_labels=[],
    ignore_subwords=True
)
ner('Mark Musterman')
[
    {
        'word': 'Mark',
        'score': 0.999686598777771,
        'entity': 'B-PER',
        'index': 1,
        'start': 0,
        'end': 4
    },
    {
        'word': 'Must',
        'score': 0.9995412826538086,
        'entity': 'I-PER',
        'index': 2,
        'start': 5,
        'end': 9
    },
    {
        'word': '##erman',
        'score': 0.9996127486228943,
        'entity': 'I-PER',
        'index': 3,
        'start': 9,
        'end': 14
    }
]

input_ids = tokens["input_ids"].cpu().numpy()[0]

score = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True)
labels_idx = score.argmax(axis=-1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are going to set the labels according to the strategy we need the scores for all labels, specially when using “average” strategy.

(idx, label_idx)
for idx, label_idx in enumerate(labels_idx)
if (self.model.config.id2label[label_idx] not in self.ignore_labels) and not special_tokens_mask[idx]
idx for idx in range(score.shape[0]) if not special_tokens_mask[idx]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this step we can only filter special_tokens_mask, because we don’t have the labels for other tokens

"word": word,
"score": score[idx][label_idx].item(),
"entity": self.model.config.id2label[label_idx],
"score": score[idx],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need score for all labels


entities += [entity]

if self.subword_label_re_alignment:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are going to set the labels according to the strategy, if subword_label_re_alignment == false we will leave the labels as they were predicted


def sub_words_label(sub_words: List[dict]) -> dict:
score = np.stack([sub["score"] for sub in sub_words])
if strategy == "default":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @joshdevins said: "If training with padded sub-words/label for first sub-word only, e.g. Max Mustermann → Max Must ##erman ##n → B-PER I-PER X X
Use the label from the first sub-word (default)"

task: str = "",
grouped_entities: bool = False,
subword_label_re_alignment: Union[bool, str] = False,
ignore_subwords: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the ignore_subwords flag be removed then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, we should remove ignore_subwords flag. @LysandreJik, @Narsil i leave some of the old codes just to pass the tests(😅 i'm new to tests). Can you help me about the tests? ( for example should I remove this test or somehow change it ? )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would think the test can be repurposed for the new flag. It would also be good to assert correctness, beside execution (it looks like the current test doesn't check the resulting output?). I'm happy to contribute directly to your branch, if it helps. Let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to keep it for backwards compatibility purposes. Can the capabilities enabled by that flag be achieved with the new flag introduced in this PR?

ignore_labels=["O"],
task: str = "",
grouped_entities: bool = False,
subword_label_re_alignment: Union[bool, str] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if aggregate_subwords would be a more suitable name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would understand aggregate_subwords better than subword_label_re_alignment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or would aggregate_strategy be even better, as we're actually prompting for a strategy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this accept enum parameters as value would be great, similar to what we do with PaddingStrategy:

class PaddingStrategy(ExplicitEnum):
"""
Possible values for the ``padding`` argument in :meth:`PreTrainedTokenizerBase.__call__`. Useful for tab-completion
in an IDE.
"""
LONGEST = "longest"
MAX_LENGTH = "max_length"
DO_NOT_PAD = "do_not_pad"

def set_subwords_label(self, entities: List[dict], strategy: str) -> dict:
def sub_words_label(sub_words: List[dict]) -> dict:
score = np.stack([sub["score"] for sub in sub_words])
if strategy == "default":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if strategy is set to True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry this is my bad. you're right. When strategy is True we should use "default" strategy. I'll fix this

@francescorubbo
Copy link
Contributor

Thank you for addressing this! I left some minor comments/questions.

Copy link
Contributor Author

@elk-cloner elk-cloner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@francescorubbo if you can fix the tests it would be great, let me know if you have any questions about my code

def set_subwords_label(self, entities: List[dict], strategy: str) -> dict:
def sub_words_label(sub_words: List[dict]) -> dict:
score = np.stack([sub["score"] for sub in sub_words])
if strategy == "default":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry this is my bad. you're right. When strategy is True we should use "default" strategy. I'll fix this

Ensure existing behavior for `ignore_subwords` and `grouped_entities`
arguments is preserved for backward compatibility.
elk-cloner and others added 4 commits March 21, 2021 09:39
Restore compatibility with existing NER pipeline tests
The refactor addresses bugs for corner cases uncovered when testing each
scenario of label re-alignment with or without ignore_subwords.
Refactor label re-alignment in NER pipeline and add tests
@francescorubbo
Copy link
Contributor

@LysandreJik I think this is ready for review now.

@LysandreJik
Copy link
Member

Hey @elk-cloner, @francescorubbo! That's an amazing work you've done here. The added tests are a wonderful addition, and will ensure the pipeline is as robust as it can be.

To make reviews easier, could you please fill in the PR description or add a comment mentioning the changes? For example:

  • What capabilities have been added
  • What are the expected changes from the current behavior

And optionally, if you have the time to:

  • Example use cases with code sample enabled by the PR
  • Previous use cases with code sample that see the behavior changes

If you don't have time to do any of that, that's perfectly fine - just let me know and I'll take care of it as soon as I have a bit of availability.

Thanks again for the great work you've done here!

@elk-cloner elk-cloner closed this Mar 30, 2021
@elk-cloner elk-cloner reopened this Mar 30, 2021
@joshdevins
Copy link
Contributor

joshdevins commented Mar 30, 2021

This looks good. I'm wondering if you can add some tests to verify the expected behaviour of two other scenarios from the bug report.

Specifically, the tests in the PR seem to ensure:
Accenture → A ##cc ##ent ##ure → B-ORG O O O → Accenture (ORG)

...but does not make assertions for mixed B/I/O labels in the same word:
Max Mustermann → Max Must ##erman ##n → B-PER I-PER I-PER O → Max Mustermann (PER)

...or inner entity labels surrounded by O labels:
Elasticsearch → El ##astic ##sea #rch → O O I-MISC O → Elasticsearch (MISC)

@francescorubbo
Copy link
Contributor

francescorubbo commented Apr 4, 2021

@joshdevins Thank you for suggesting to test those additional scenarios. Testing for those helped me identify some bugs in the previous implementation. I believe the new test should cover all three scenarios now.

@francescorubbo
Copy link
Contributor

francescorubbo commented Apr 4, 2021

@LysandreJik I'll add the requested notes here, as I don't seem to have permissions to edit the PR description. Maybe @elk-cloner can transfer some of the info there.

What capabilities have been added

label realignment

Token predictions for subwords can be realigned with 4 different strategies

  • default: reset all subword token predictions except for first token
  • first: the prediction for the first token in the word is assigned to all subword tokens
  • max: the highest confidence prediction among the subword tokens is assigned to all subword tokens
  • average: the average pool of the predictions for all subwords is assigned to all subword tokens
  • ignore subwords: enable ignoring subwords by merging tokens

What are the expected changes from the current behavior

New flag subword_label_re_alignment enables realignment.

Already existing flag ignore_subwords actually enables merging subwords.

Example use cases with code sample enabled by the PR

ner = transformers.pipeline('ner',
                            model='elastic/distilbert-base-cased-finetuned-conll03-english',
                            tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
                            ignore_labels = [],
                            ignore_subwords=False,
                            subword_label_re_alignment='average'
                           )
ner('Mark Musterman')
[{'word': 'Mark',
  'score': 0.999686598777771,
  'index': 1,
  'start': 0,
  'end': 4,
  'is_subword': False,
  'entity': 'B-PER'},
 {'word': 'Must',
  'score': 0.9995412826538086,
  'index': 2,
  'start': 5,
  'end': 9,
  'is_subword': False,
  'entity': 'I-PER'},
 {'word': '##erman',
  'score': 0.9996127486228943,
  'index': 3,
  'start': 9,
  'end': 14,
  'is_subword': True,
  'entity': 'I-PER'}]
ner = transformers.pipeline('ner',
                            model='elastic/distilbert-base-cased-finetuned-conll03-english',
                            tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
                            ignore_labels = [],
                            ignore_subwords=True,
                            subword_label_re_alignment='average'
                           )
ner('Mark Musterman')
[{'word': 'Mark',
  'score': 0.999686598777771,
  'index': 1,
  'start': 0,
  'end': 4,
  'is_subword': False,
  'entity': 'B-PER'},
 {'word': 'Musterman',
  'score': 0.9995412826538086,
  'index': 2,
  'start': 5,
  'end': 9,
  'is_subword': False,
  'entity': 'I-PER'}]

Previous use cases with code sample that see the behavior changes

ner = transformers.pipeline('ner',
                            model='elastic/distilbert-base-cased-finetuned-conll03-english',
                            tokenizer='elastic/distilbert-base-cased-finetuned-conll03-english',
                            ignore_labels = [],
                            ignore_subwords=True
                           )
ner('Mark Musterman')
[{'word': 'Mark',
  'score': 0.999686598777771,
  'entity': 'B-PER',
  'index': 1,
  'start': 0,
  'end': 4},
 {'word': 'Must',
  'score': 0.9995412826538086,
  'entity': 'I-PER',
  'index': 2,
  'start': 5,
  'end': 9},
 {'word': '##erman',
  'score': 0.9996127486228943,
  'entity': 'I-PER',
  'index': 3,
  'start': 9,
  'end': 14}]

@elk-cloner
Copy link
Contributor Author

Thank you, @francescorubbo, I added them to PR.

Hamel Husain and others added 2 commits April 27, 2021 10:04
* finish quicktour

* fix import

* fix print

* explain config default better

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>
* fix docs for decoder_input_ids

* revert the changes for bart and mbart
@francescorubbo
Copy link
Contributor

francescorubbo commented Apr 28, 2021

There was a new release of the black library which touched a lot of files, so you will need to rebase your PR on master to have the quality tests pass again.

I did merge master (see 031f3ef). Shouldn't it address that?

@cceyda
Copy link
Contributor

cceyda commented Apr 28, 2021

I think originally there was also mention of saving the aggregation_strategy to the model config?
since it makes the most sense to use the same strategy the model was trained on, ignoring subwords or else.

@joshdevins
Copy link
Contributor

I think originally there was also mention of saving the aggregation_strategy to the model config?
since it makes the most sense to use the same strategy the model was trained on, ignoring subwords or else.

@cceyda Yes, this was my original proposal, but I think it might be too much for one PR. I would not close the original issue (#10263) until the other items are addressed, but perhaps a new/smaller PR can address saving the strategy used at training/evaluation time to the model config file.

sgugger and others added 19 commits April 28, 2021 09:10
* Update min versions in README and add Flax

* Adapt index
…xt_pair` parameter (huggingface#11486)

* Update tokenization_utils_base.py

* add assertion

* check batch len

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <[email protected]>

* add error message

Co-authored-by: Sylvain Gugger <[email protected]>
Ensure existing behavior for `ignore_subwords` and `grouped_entities`
arguments is preserved for backward compatibility.
The refactor addresses bugs for corner cases uncovered when testing each
scenario of label re-alignment with or without ignore_subwords.
Subwords can be skipped indipendently of label realignment.
The `aggregation_strategy` argument can be either string
or an AggregationStrategy enum. If a string, we attempt to cast
into the corresponding AggregationStrategy enum.
Given that the label realignment is now only applied
when subwords are ignored, the default strategy does
not need to reset the score for all subwords.
@francescorubbo
Copy link
Contributor

ugh...this ^ is why I hate rebasing on big project repos...
@sgugger from a cursory look the 215 (!) file diffs look legit, please let me know if this PR needs any more work before you can merge.

@francescorubbo
Copy link
Contributor

@LysandreJik @sgugger Is there more work needed for this PR? If the rebase is an issue, I can create a new PR with only the relevant changes, but we would loose the commit history.

@sgugger
Copy link
Collaborator

sgugger commented May 6, 2021

We can't see the diff of the PR anymore after the rebase, so you should close this one and open a new one from the same branch please. (GitHub completely sucks at properly showing rebases, unless you force push after the rebase.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NER label re-alignment always expects B labelled first sub-words