Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change position_id calculation to fix last token being ignored. #1

Closed
wants to merge 1 commit into from

Conversation

cceyda
Copy link

@cceyda cceyda commented Apr 12, 2023

So as we talked on slack the current model ignores the last token, as also seen in the demo.

we add +2 to position_ids which is done so we can use 0 for the [PAD] token.
But then currently we are not using the position_id=1 for any token, and [SEP] token which is a special token gets whatever is the following id in the sentence (depending on sentence length). & I guess this confuses the model somehow to always guess the last token as non-entity. I suggest using this unused position_id=1 for the [SEP] token.

I trained a model this way and it solved the last token ignoring problem (proven empirically 😆 for an Electra Model)

In my understanding position_ids are calculated like this:

  • input tokens are: "[CLS], My, test, sentence, [SEP]" -> 0,1,2,3,4 to which we add +2. meaning now [CLS] always has id 2 and our actual word tokens start from 3.
  • this tracks because "start_position_ids" for spans are calculated by batch_encoding.word_to_tokens and goes like
    1,1,1,2,2,3-> which we also added +2 so it becomes 3,3,3,4,4,5
  • same for "end_position_ids" 1,2,3,2,3,3-> which we also added +2 so it becomes 3,4,5,4,5,5

for reference(spans don't include cls&sep):

spans=
0:(0, 1) # My
1:(0, 2) # My test
2:(0, 3) # My test sentence
3:(1, 2) # test 
4:(1, 3) # test sentence
5:(2, 3) # sentence

Can argue that [SEP] token having an id following the sentence makes sense but I think it just confuses it. this might also be because the last token never actually has a "span" I mean it is always just itself ("sentence" in the example above ). which then gets effected more by being next to an always ignored token [SEP]

So with this change positon_ids would be like: 2[CLS],3,4,5,1[SEP],0[PAD],0[PAD],0[PAD],0[PAD]

I don't know how this should be different for different model backbones... maybe some models use more special tokens and position_ids would need different values. I suggest re-training the demo example and seeing if it works before merging & if it makes logical sense.

@tomaarsen
Copy link
Owner

Hey! Thanks for the PR!
I believe that your understanding of the position_ids is exactly correct.

Just thinking out loud here, the [SEP] before this PR has a position id of num_tokens + 1 (normally it has num_tokens - 1, and then we add +2 based on PL-Marker's approach). And then this PR overrides the [SEP] position ID to be 1 instead. Is that right?

Out of curiosity, have you tried using 0 just like the padding?

I'll definitely retrain the demo model (or a smaller version of it that I can train in a few minutes) to try and figure out if this indeed resolves the issue. I'll likely work on that tonight or tomorrow morning as I'm currently investigating the performance downsides of an approach that can improve training times by ~30%.

  • Tom Aarsen

@tomaarsen tomaarsen added the bug Something isn't working label Apr 12, 2023
@cceyda
Copy link
Author

cceyda commented Apr 12, 2023

Yes,exactly. the [SEP] before this PR has a position id of num_tokens + 1. And this PR overrides the [SEP] position ID to be 1 instead.

I haven't tried using 0 just like the padding... which actually makes a lot of sense. I'll it next time!
I'm still not %100 on how the sep token's position_id has such an effect on the results. Maybe it is a combined effect with something else of what I'm not sure.

@tomaarsen
Copy link
Owner

It definitely could be. I think it's a little confusing that the position ID of a token that gets ignored should affect performance at all. Might be indicative or another problem somewhere. (For example, perhaps the position ID of the SEP is accidentally used as the position ID of the last "real" token in some situations instead)

@tomaarsen
Copy link
Owner

Just trained a quick model with CoNLL with this, it still doesn't work for me locally :(

>>> model.predict("I want to go to Paris.") 
[{'span': 'Paris', 'label': 'LOC', 'score': 0.9978813529014587, 'char_start_index': 16, 'char_end_index': 21}]
>>> model.predict("I want to go to Paris")  
[]

Time to do some more debugging

@tomaarsen
Copy link
Owner

This line

position_ids[num_tokens] = 1

Actually turns

2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, ...
^                    ^
C                    S
L                    E
S                    P

into

2, 3, 4, 5, 6, 7, 8, 9, 1, 0, 0, 0, 0, ...

In a 6-token setup ("I want to go to Paris"). So, it doesn't override the [SEP] position IDs, it turns the next one into a 1 instead of a 0.

@cceyda
Copy link
Author

cceyda commented Apr 13, 2023

true should have been position_ids[num_tokens-1] = 1 for it to be the sep token... now it just makes it that much weirder as to why that worked... So I checked the code again and I had actually used this:

position_ids = torch.arange(num_tokens-1, dtype=torch.int) + 2
position_ids = F.pad(position_ids, (0, total_size - len(position_ids)), value=0)
position_ids[num_tokens]=1

So was making sep=0 🤦‍♀️ and adding a 1 after. perhaps reaching the right conclusion with the wrong solution 🤣

2[CLS], 3, 4, 5, 6, 7, 8, 0[SEP], 1, 0, 0, 0, 0, ...

I think this means we can just use 0 for the sep token

@tomaarsen
Copy link
Owner

tomaarsen commented Apr 13, 2023

I've just tried using 0 for the [SEP] token, but that didn't want to work either. I looked into the logits for Paris in my example, and they're:

tensor([ 4.2867,  4.1023, -1.4353, -1.5833, -3.2249], device='cuda:0',
       grad_fn=<SelectBackward0>)

i.e. index 0 with "Outside" barely wins over index 1 with "Location".
For comparison, the logits for Paris if it's followed by a dot are:

tensor([ 0.1040,  6.9440, -1.7654, -0.5792, -1.1679], device='cuda:0',
       grad_fn=<SelectBackward0>)

Which is "Location" with confidence.

I'm now training a model on a modified version of CoNLL that has all dots at the end of sentences stripped. If that model is able to classify Paris without a dot, then the former models may have just learned that the final token is never an entity.


EDIT: The model still won't classify Paris!
Logits for "Paris" without a dot at the end of the sentence:

tensor([ 9.4466, -1.8438, -2.2387, -2.8667, -3.5171], device='cuda:0',
       grad_fn=<SelectBackward0>)

Clearly "Outside".
Logits for "Paris" with a dot at the end (which it has never encountered in training):

tensor([-1.6024,  7.3148, -1.9134,  0.1508, -1.5686], device='cuda:0',
       grad_fn=<SelectBackward0>)

Clearly "Position".

I'll try your solution too!

@tomaarsen
Copy link
Owner

I've narrowed it down 🎉 . It's caused by these two lines:

yield (reduced_label_id, start_idx, idx)

yield (entity_label_id, start_idx, idx)

These "final cases" cover the situation where the last token is a part of the entity. The start and end indices (i.e. the second and third elements in the tuples) should always be at least one apart, so that tokens[start_idx: end_idx] shows the correct span, e.g.:

>>> ['I', 'want', 'to', 'go', 'to', 'Paris'][5:6]
['Paris']

However, with this approach, the idx can be 5 at maximum, when the end index needs to be 6 instead. As a result, the ner tags of [0, 0, 0, 0, 0, 1] (where 1 is e.g. B-LOC) is normalized to [(1, 5, 5)], while it should be [(1, 5, 6)]. So, the tokenizer which is responsible for converting the normalized labels (i.e. [(1, 5, 5)] in our case) to a full list of labels, will produce e.g.:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Instead of

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

With other words, it'll always learn that the very last word is no entity, even if it actually is an entity. Now, the model may still "break free" from this and output the last word as an entity. After all, it may have also learned that Paris is a very strong clue for "Location", and so sometimes it (almost) predicts it.


Proof that the fix works:

from datasets import Dataset
from span_marker import SpanMarkerModel, Trainer
from transformers import TrainingArguments


train_dataset = Dataset.from_dict({
    "tokens": ["Tom wants to go to Paris".split()],
    "ner_tags": [[1, 0, 0, 0, 0, 2]],
})
labels = ["O", "Person", "Location"]

model_name = "prajjwal1/bert-tiny"
model = SpanMarkerModel.from_pretrained(
    model_name,
    labels=labels,
    model_max_length=256,
    marker_max_length=128,
    entity_max_length=4,
)

args = TrainingArguments(
    output_dir="models/last_token_fix",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=300,
    weight_decay=0.01,
    report_to="none",
    push_to_hub=False,
    logging_first_step=True,
    logging_steps=50,
    bf16=True,
    dataloader_num_workers=0,
    warmup_ratio=0.1,
    seed=42,
    save_total_limit=3,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
)

trainer.train()

metrics = trainer.evaluate()
print(metrics)

print(model.predict("Tom wants to go to Paris"))

produces

[{'span': 'Tom', 'label': 'Person', 'score': 0.954353928565979, 'char_start_index': 0, 'char_end_index': 3}, {'span': 'Paris', 'label': 'Location', 'score': 0.9681439399719238, 'char_start_index': 19, 'char_end_index': 24}]

@tomaarsen
Copy link
Owner

I'll close this now. Thanks for letting me know about this issue and the help with chasing it down!

@tomaarsen tomaarsen closed this Apr 13, 2023
@cceyda
Copy link
Author

cceyda commented Apr 14, 2023

🤯 Labels were the first thing I looked at because it felt like a classic last yield missing bug (can't tell you how many times I missed processing a last batch that way) & but the yield was there and I missed the idx was off (very nice catch 🦅👀 🥳 )
Then I just continued only debugging the inference part which had no mistakes (spans & indexes were all okay). So I assumed it was the model when position_id change worked! Lesson learned; when debugging model behaviour you never know!

This brings me to think why my wrong fix was working for me, and it was working very correctly too 🤔 I think it was because:
Model learned to predict wrongly labeled "...ent,no_ent,[sep],..." to be "...ent,ent,[sep]..." when the [sep] position id was more clearly marked as 1 (vs when it was just num_tokens + 1)... was influenced more by the "token occurrence" to predict the last token as ent but not so much that it was predicting all non-entity tokens as an entity. example: "Paris." was correctly predicted as "Paris" and "." wasn't an entity, as you said because of token occurrence strength)
also wouldn't how position_ids are calculated depend on how the base model was pretrained 🤔 Will need to look more into that.

Well this was a wild debugging ride & It all makes sense now!
Thank you

@tomaarsen
Copy link
Owner

tomaarsen commented Apr 14, 2023

wouldn't how position_ids are calculated depend on how the base model was pretrained

Ideally, yeah. I think that's why PL-Marker does +2 for RoBERTa models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants