Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Collator Incorrect When Using a Decoder Prefix #96

Open
seanlgoldberg opened this issue Aug 8, 2023 · 0 comments
Open

Data Collator Incorrect When Using a Decoder Prefix #96

seanlgoldberg opened this issue Aug 8, 2023 · 0 comments

Comments

@seanlgoldberg
Copy link

https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/core/trainers/data_collator.py

Hello!

In the call of the DataCollator class the max feature length of the decoder is determined by the max size of the labels:

max_feature_len_decoder = max([f["labels"].shape[0] for f in features])

This makes the 'target_len_decoder' variable dependent on the label size only. Thus, if you're using a decoder prefix (via 'decoder_input_ids'), the sequence gets incorrectly truncated to the size of the labels:

if key in ['decoder_input_ids', 'labels', 'decoder_attention_mask', 'decoder_seg_data']:
batched_feature = torch.stack([pad_sequence_native(f[key], target_len_decoder, pad_value) for f in features], dim=0)

Thus, whenever you have a decoder prefix longer than the label length, it gets cut off. This might not be an issue for UDOP pretraining, but is very much an issue for something like question-answering fine-tuning.

A better way to calculate the length would be:

max_feature_len_decoder = max([f["labels"].shape[0]+f['decoder_input_ids'].shape[0] for f in features])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant