-
Notifications
You must be signed in to change notification settings - Fork 31.7k
[WIP] Ner pipeline grouped_entities fixes #5970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Ner pipeline grouped_entities fixes #5970
Conversation
… same type
(B-type1 B-type1) != (B-type1 I-type1)
[Bug Fix] add an option `ignore_subwords` to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
The simplest fix is to just group the subwords with the first wordpiece.
[TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
[TODO] handle different wordpiece_prefix ## ? possible approaches:
get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
have an _is_subword(token)
[Feature add] added option to `skip_special_tokens`. Cause It was harder to remove them after grouping.
[Additional Changes] remove B/I prefix on returned grouped_entities
[Feature Request/TODO] Return indexes?
[Bug TODO] can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @cceyda , this generally looks good to me with just some comments around:
- I think we can hard code
skip_special_tokens=True, similar to the approach inTextGenerationPipeline. - I'm wondering about why we shouldn't group B type entities? Hoping you can elaborate your rationale here
- Please make sure to add the test cases to
NerPipelineTestsin test_pipelines. These should be similar test cases to the ones you referenced in this PR (that used to fail, but now pass with these changes). - Please update the tests to account for these changes. The current failures seem to be due to removal of the prefixes in "entity type".
Thanks!
|
I'm wondering should the B & I part maybe separated from the entity type part? In the sense that you average the entities (disregarding the B/I part) and vice-versa. I now have the feeling that only the first subtoken decides whether the complete word is a B or an I. |
|
I want to complete this but ran into another issue while working it: All [UNK] tokens get mapped to [UNK] in the output, instead of the actual input token (because the code is getting from ids->tokens), Also [UNK]s gets lost when using skip_special_tokens (#6863) |
|
Dear @cceyda, In the last couple of days I started to work with Huggingface's transformers and especially NER-classification. I ran into issues that has been previously addressed in other issues you just mentioned at the beginning. Especially that subtokens that were classified with 'O' were not properly merged with the full token. For example (Dutch): Gives me as group entities: I expect: However, the considered subtokens are classified as 'O': {'word': '[CLS]', 'score': 0.9999999403953552, 'entity': 'O', 'index': 0} I believe your pull request addresses these issues properly. I was wondering if there is still the intention to solve these issues. Disclaimer: I am a total newbie to git (just set up an account), so please be mild, haha. Thank you in advance, Monique |
|
@cceyda I actually want this PR to move forward. Are you okay collaborating on your fork (can add me as collaborator)? I can help out with some of the issues failing so we can get this merged 😄 |
|
@enzoampil I have added you as a collaborator.
It is not optimal but it worked for my use cases. |
|
I have changed the And honestly I don't know why subwords shouldn't be ignored for most cases. (Unless there is need for some custom logic that determines a words tag; ie by averaging the wordpieces etc etc. In which case grouped_entities shouldn't be used 🤔 ) [todo]
|
Codecov Report
@@ Coverage Diff @@
## master #5970 +/- ##
===========================================
+ Coverage 52.05% 78.36% +26.30%
===========================================
Files 236 168 -68
Lines 43336 32338 -10998
===========================================
+ Hits 22560 25341 +2781
+ Misses 20776 6997 -13779
Continue to review full report at Codecov.
|
|
Dear @cceyda, Last two days I worked on your branch to see how it performs on my own input texts. When I use the following line of code (as you suggest under 'Usage' above): pipeline('ner', model=model, tokenizer=tokenizer, ignore_labels=[], grouped_entities=True, skip_special_tokens=True, ignore_subwords=True) I get the error: TypeError: init() got an unexpected keyword argument 'skip_special_tokens'. When looking in the file transformer.pipelines and looking specifically for the tokenclassificationpipeline, it seems that it is not yet implemented. Or am I missing something? Best, Monique |
|
@Monique497 sorry for the delay
from transformers import (
AutoModelForTokenClassification,
AutoTokenizer,
pipeline,
)
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) # note the fast tokenizer use
# ignore_subwords = True by default
nlp = pipeline("ner",model=model,tokenizer=tokenizer, grouped_entities=True)
inputs="test sentence"
output=nlp(inputs)
# you can pass it like this
nlp(inputs,offset_mapping=mappings_you_calculate)
something like this: def sub_fn(token):
if token.starts_with("%%"): return True
tokenizer.is_subword_fn=sub_fn
def convert_tokens_to_string(self, tokens):
out_string = " ".join(tokens).replace(" %%", "").strip()
return out_string
tokenizer.convert_tokens_to_string=convert_tokens_to_string@enzoampil what are your thoughts on this? |
tests/test_pipelines.py
Outdated
|
|
||
| for ungrouped_input, grouped_result in zip(ungrouped_ner_inputs, expected_grouped_ner_results): | ||
| self.assertEqual(nlp.group_entities(ungrouped_input), grouped_result) | ||
| if nlp.grouped_entities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conditioned so that grouped_entities=False tests won't fail because of grouped_entities=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tests/test_pipelines.py
Outdated
| for ungrouped_input, grouped_result in zip(ungrouped_ner_inputs, expected_grouped_ner_results): | ||
| self.assertEqual(nlp.group_entities(ungrouped_input), grouped_result) | ||
| if nlp.grouped_entities: | ||
| if nlp.ignore_subwords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added case for ignore_subwords=True and False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What else can I do for this pr for merge? It has been a while
tests/test_pipelines.py
Outdated
| {"entity_group": "B-PER", "score": 0.9997273534536362, "word": "Andrés Pastrana"}, | ||
| {"entity_group": "B-ORG", "score": 0.8589080572128296, "word": "Farc"}, | ||
| {"entity_group": "PER", "score": 0.999369223912557, "word": "Consuelo Araújo Noguera"}, | ||
| {"entity_group": "PER", "score": 0.9997771680355072, "word": "Andrés Pastrana"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test for hyphenated names (ex. Juantia Gomez-Cortez) would be useful, especially given that the fast and slow tokenizers have different codepaths for reconstructing the original text. I had to implement grouping of named entities myself recently and was tripped up by that corner case.
| corresponding token in the sentence. | ||
| """ | ||
| inputs = self._args_parser(*args, **kwargs) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed is_subword detection logic: by comparing length of token(##token) with the original text span mapping (Assuming subwordpieces get prefixed by something).
Incase the user wants some other logic they can first get ungrouped entities add is_subword:bool field to entities and call pipeline.group_entities themselves.
| word_ref = sentence[start_ind:end_ind] | ||
| word = self.tokenizer.convert_ids_to_tokens([int(input_ids[idx])])[0] | ||
| is_subword = len(word_ref) != len(word) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed as suggested! I agree it is much cleaner this way. Umm it looks like fast tokenizers have a convert_tokens_to_string method now? 😕
src/transformers/pipelines.py
Outdated
| task identifier: :obj:`"sentiment-analysis"` (for classifying sequences according to positive or negative | ||
| sentiments). | ||
| If multiple classification labels are available (:obj:`model.config.num_labels >= 2`), the pipeline will run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this to check offset_mapping if provided. (does a simple batch_size check)
| for model_name in self.small_models: | ||
| nlp = pipeline( | ||
| task="ner", model=model_name, tokenizer=model_name, grouped_entities=True, ignore_subwords=False | ||
| task="ner", model=model_name, tokenizer=tokenizer, grouped_entities=True, ignore_subwords=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know why the tests are failing, I get normal results when running them outside of the test suite 😕 ? I was being just careless 🤦 . should still add cases for not fast tokenizers
|
Thanks for iterating! I'll check this today. |
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Played around with it, works well, and the implementation seems robust. Looks good, LGTM! Thanks for iterating.
@stefan-it, do you want to take a quick look?
|
Merging this as soon as it's green, thank you for iterating on the PR! Sorry this took so long to merge. |
|
Thanks @LysandreJik and congrats @cceyda !! 😄 |
This comment was marked as spam.
This comment was marked as spam.
|
FYI this broke the NER pipeline: from transformers import pipeline
nlp = pipeline("ner")
nlp("My name is Alex and I live in New York")crashes with the following error: Trying to see if this can be quickly patched, otherwise we'll revert the PR while we patch this. |
|
oops! although returning unk tokens with slow tokenizers are not the best, I agree not forcing a fast tokenizer with a default of ignore_subword=True looks better for keeping the compatibility. I saw a bit late the _args_parser line was mis-merged during this pr merge and I see it is fixed/improved on the patch. I wasn't sure on how to test for the offset_mapping argument with the new test structure (which looks to be good at the patch). Sorry for the trouble 😅 @LysandreJik |
|
No worries, thanks for taking a look at the patch! |
There are many issues with ner pipeline using grouped_entities=True
#5077
#4816
#5730
#5609
#6514
#5541
[Bug Fix] add an option
ignore_subwordsto ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
have an _is_subword(token)
[Feature add] added option to
skip_special_tokens. Cause It was harder to remove them after grouping.[Additional Changes] remove B/I prefix on returned grouped_entities
Edit: Ignored subwords' scores are also ignored by setting them to nan and using nanmean
Edit: B entities of different type are separated (as per BIO tag definition)
Edit: skip_special_tokens is now the default behavior
Edit: ignore_subwords is now the default behavior
Edit: more flexibility for custom non-standard tokenizers through tokenizer.is_subword_fn, tokenizer.convert_tokens_to_string
Edit: [fix UNK token related bugs by mapping UNK tokens to the correct original string] Use fast tokenizer or pass offset_mapping
Usage
pipeline('ner', model=model, tokenizer=tokenizer, ignore_labels=[], grouped_entities=True, ignore_subwords=True)