Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[Feature] Add Ner Suffix feature #1123

Open
wants to merge 1 commit into
base: v0.x
Choose a base branch
from

Conversation

XINGXIAOYU
Copy link

Description

Add a parameter "tagging_first_token", so you can choose to use the first piece or the last piece of each word. The first piece catches the prefix feature of a word, and the last piece catches the suffix feature of a word.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@XINGXIAOYU XINGXIAOYU requested a review from a team as a code owner January 19, 2020 08:03
@codecov
Copy link

codecov bot commented Jan 19, 2020

Codecov Report

Merging #1123 into master will decrease coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1123      +/-   ##
==========================================
- Coverage   88.34%   88.21%   -0.13%     
==========================================
  Files          66       66              
  Lines        6290     6290              
==========================================
- Hits         5557     5549       -8     
- Misses        733      741       +8
Impacted Files Coverage Δ
src/gluonnlp/model/bert.py 88.06% <0%> (-4.12%) ⬇️
src/gluonnlp/utils/files.py 45.9% <0%> (+3.27%) ⬆️

@mli
Copy link
Member

mli commented Jan 19, 2020

Job PR-1123/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1123/1/index.html

@@ -77,6 +77,8 @@ def parse_args():
help='Learning rate for optimization')
arg_parser.add_argument('--warmup-ratio', type=float, default=0.1,
help='Warmup ratio for learning rate scheduling')
arg_parser.add_argument('--tagging-first-token', type=str2bool, default=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about parser.add_argument('--tag-last-token', action='store_true'). It seems simpler to call finetune_bert.py --tag-last-token than finetune_bert.py --tagging-first-token=False.

In either case please update the test case in scripts/tests/ to run invoke the finetune_bert.py with both options. You can parametrize the test following for example Haibin's recent PR: https://github.com/dmlc/gluon-nlp/pull/1121/files#diff-fa82d34d543ff657c2fe09553bd0fa34R234

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update it.

@sxjscience
Copy link
Member

Have you found any performance differences?

@XINGXIAOYU
Copy link
Author

@sxjscience I've tried the default parameters set in the scripts on conll2003 dataset. The performance using suffix feature will be a little lower than using the prefix feature.

@sxjscience
Copy link
Member

I think we can try the following:

  • use the state corresponds to the first subword token
  • use the state corresponds to the last subword token
  • use the average pooling on top of the states of all the subwords
  • use the max pooling on top of the states of all the subwords

@sxjscience
Copy link
Member

One problem is that since we are using self-attention, we are able to tailor the attention weights to cover the first, last, average cases. Thus, I don't think selecting the first/last token will impact the performance much.

@XINGXIAOYU
Copy link
Author

@sxjscience In classification task, I think it does not matter. But in sequence labeling task, one word has one label. If we break the word 'w' into several subwords [sw1,sw2,...], then only sw1 will have the label, and the labels of the others will set to NULL. I think it does not make sense.

@@ -81,7 +85,7 @@ def main(config):
train_config.dropout_prob)

dataset = BERTTaggingDataset(text_vocab, None, None, config.test_path,
config.seq_len, train_config.cased, tag_vocab=tag_vocab)
config.seq_len, train_config.cased, tag_vocab=tag_vocab,tagging_first_token=config.tagging_first_token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls add white space after the comma.

@sxjscience
Copy link
Member

sxjscience commented Jan 20, 2020 via email

entries.append(PredictedToken(text=text,
true_tag=true_tag, pred_tag=pred_tag))
tmptext = ''
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can both cases be merged here? For example, if len(tmptext) == 0, you can still have text = tmptext + token_text which is equivalent to token_text.

true_tag=true_tag, pred_tag=pred_tag))

if true_tag == NULL_TAG:
tmptext += token_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better name it as tmp_text. Or what about partial_text?

@XINGXIAOYU
Copy link
Author

@sxjscience Agree with you, and I'll try this method.

@liuzh47
Copy link
Contributor

liuzh47 commented Jan 20, 2020

A reasonable approach is to mask the loss corresponding to the other sub-word tokens and only use the state of the first subword as the contextualized word embedding.

I am confused about this part. Why masking loss of other sub-word tokens is reasonable? For example on NER tasks, suffix is much more important than prefix in words like firefighter.

@sxjscience
Copy link
Member

A reasonable approach is to mask the loss corresponding to the other sub-word tokens and only use the state of the first subword as the contextualized word embedding.

I am confused about this part. Why masking loss of other sub-word tokens is reasonable? For example on NER tasks, suffix is much more important than prefix of the words like firefighter.

Since we are using attention, the higher-level state associated with fire has already captured the information about fighter. Suffix should be more appropriate if we are using a LSTM. However, we are using attention and we are actually attending to the whole input.

@chenw23
Copy link
Member

chenw23 commented Jun 19, 2020

@sxjscience Do you think we should continue with this pull request?
If we should, maybe I can take over the remaining work that needs to be done on this pull request.

@szha szha changed the base branch from master to v0.x August 13, 2020 02:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants