-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change position_id calculation to fix last token being ignored. #1
Conversation
Hey! Thanks for the PR! Just thinking out loud here, the [SEP] before this PR has a position id of Out of curiosity, have you tried using 0 just like the padding? I'll definitely retrain the demo model (or a smaller version of it that I can train in a few minutes) to try and figure out if this indeed resolves the issue. I'll likely work on that tonight or tomorrow morning as I'm currently investigating the performance downsides of an approach that can improve training times by ~30%.
|
Yes,exactly. the [SEP] before this PR has a position id of I haven't tried using 0 just like the padding... which actually makes a lot of sense. I'll it next time! |
It definitely could be. I think it's a little confusing that the position ID of a token that gets ignored should affect performance at all. Might be indicative or another problem somewhere. (For example, perhaps the position ID of the SEP is accidentally used as the position ID of the last "real" token in some situations instead) |
Just trained a quick model with CoNLL with this, it still doesn't work for me locally :( >>> model.predict("I want to go to Paris.")
[{'span': 'Paris', 'label': 'LOC', 'score': 0.9978813529014587, 'char_start_index': 16, 'char_end_index': 21}]
>>> model.predict("I want to go to Paris")
[] Time to do some more debugging |
This line position_ids[num_tokens] = 1 Actually turns
into
In a 6-token setup ("I want to go to Paris"). So, it doesn't override the [SEP] position IDs, it turns the next one into a 1 instead of a 0. |
true should have been
So was making sep=0 🤦♀️ and adding a 1 after. perhaps reaching the right conclusion with the wrong solution 🤣
I think this means we can just use 0 for the sep token |
I've just tried using 0 for the [SEP] token, but that didn't want to work either. I looked into the logits for
i.e. index 0 with "Outside" barely wins over index 1 with "Location".
Which is "Location" with confidence. I'm now training a model on a modified version of CoNLL that has all dots at the end of sentences stripped. If that model is able to classify Paris without a dot, then the former models may have just learned that the final token is never an entity. EDIT: The model still won't classify Paris!
Clearly "Outside".
Clearly "Position". I'll try your solution too! |
I've narrowed it down 🎉 . It's caused by these two lines:
These "final cases" cover the situation where the last token is a part of the entity. The start and end indices (i.e. the second and third elements in the tuples) should always be at least one apart, so that tokens[start_idx: end_idx] shows the correct span, e.g.:
>>> ['I', 'want', 'to', 'go', 'to', 'Paris'][5:6]
['Paris'] However, with this approach, the idx can be 5 at maximum, when the end index needs to be 6 instead. As a result, the ner tags of
Instead of
With other words, it'll always learn that the very last word is no entity, even if it actually is an entity. Now, the model may still "break free" from this and output the last word as an entity. After all, it may have also learned that Paris is a very strong clue for "Location", and so sometimes it (almost) predicts it. Proof that the fix works: from datasets import Dataset
from span_marker import SpanMarkerModel, Trainer
from transformers import TrainingArguments
train_dataset = Dataset.from_dict({
"tokens": ["Tom wants to go to Paris".split()],
"ner_tags": [[1, 0, 0, 0, 0, 2]],
})
labels = ["O", "Person", "Location"]
model_name = "prajjwal1/bert-tiny"
model = SpanMarkerModel.from_pretrained(
model_name,
labels=labels,
model_max_length=256,
marker_max_length=128,
entity_max_length=4,
)
args = TrainingArguments(
output_dir="models/last_token_fix",
learning_rate=5e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
num_train_epochs=300,
weight_decay=0.01,
report_to="none",
push_to_hub=False,
logging_first_step=True,
logging_steps=50,
bf16=True,
dataloader_num_workers=0,
warmup_ratio=0.1,
seed=42,
save_total_limit=3,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=train_dataset,
)
trainer.train()
metrics = trainer.evaluate()
print(metrics)
print(model.predict("Tom wants to go to Paris")) produces
|
I'll close this now. Thanks for letting me know about this issue and the help with chasing it down! |
🤯 Labels were the first thing I looked at because it felt like a classic last yield missing bug (can't tell you how many times I missed processing a last batch that way) & but the This brings me to think why my wrong fix was working for me, and it was working very correctly too 🤔 I think it was because: Well this was a wild debugging ride & It all makes sense now! |
Ideally, yeah. I think that's why PL-Marker does +2 for RoBERTa models. |
So as we talked on slack the current model ignores the last token, as also seen in the demo.
we add +2 to
position_ids
which is done so we can use 0 for the [PAD] token.But then currently we are not using the
position_id
=1 for any token, and [SEP] token which is a special token gets whatever is the following id in the sentence (depending on sentence length). & I guess this confuses the model somehow to always guess the last token as non-entity. I suggest using this unusedposition_id
=1 for the [SEP] token.I trained a model this way and it solved the last token ignoring problem (proven empirically 😆 for an Electra Model)
In my understanding
position_ids
are calculated like this:batch_encoding.word_to_tokens
and goes like1,1,1,2,2,3-> which we also added +2 so it becomes 3,3,3,4,4,5
for reference(spans don't include cls&sep):
Can argue that [SEP] token having an id following the sentence makes sense but I think it just confuses it. this might also be because the last token never actually has a "span" I mean it is always just itself ("sentence" in the example above ). which then gets effected more by being next to an always ignored token [SEP]
So with this change
positon_ids
would be like: 2[CLS],3,4,5,1[SEP],0[PAD],0[PAD],0[PAD],0[PAD]I don't know how this should be different for different model backbones... maybe some models use more special tokens and position_ids would need different values. I suggest re-training the demo example and seeing if it works before merging & if it makes logical sense.