You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the call of the DataCollator class the max feature length of the decoder is determined by the max size of the labels:
max_feature_len_decoder = max([f["labels"].shape[0] for f in features])
This makes the 'target_len_decoder' variable dependent on the label size only. Thus, if you're using a decoder prefix (via 'decoder_input_ids'), the sequence gets incorrectly truncated to the size of the labels:
if key in ['decoder_input_ids', 'labels', 'decoder_attention_mask', 'decoder_seg_data']:
batched_feature = torch.stack([pad_sequence_native(f[key], target_len_decoder, pad_value) for f in features], dim=0)
Thus, whenever you have a decoder prefix longer than the label length, it gets cut off. This might not be an issue for UDOP pretraining, but is very much an issue for something like question-answering fine-tuning.
A better way to calculate the length would be:
max_feature_len_decoder = max([f["labels"].shape[0]+f['decoder_input_ids'].shape[0] for f in features])
The text was updated successfully, but these errors were encountered:
https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/core/trainers/data_collator.py
Hello!
In the call of the DataCollator class the max feature length of the decoder is determined by the max size of the labels:
max_feature_len_decoder = max([f["labels"].shape[0] for f in features])
This makes the 'target_len_decoder' variable dependent on the label size only. Thus, if you're using a decoder prefix (via 'decoder_input_ids'), the sequence gets incorrectly truncated to the size of the labels:
if key in ['decoder_input_ids', 'labels', 'decoder_attention_mask', 'decoder_seg_data']:
batched_feature = torch.stack([pad_sequence_native(f[key], target_len_decoder, pad_value) for f in features], dim=0)
Thus, whenever you have a decoder prefix longer than the label length, it gets cut off. This might not be an issue for UDOP pretraining, but is very much an issue for something like question-answering fine-tuning.
A better way to calculate the length would be:
max_feature_len_decoder = max([f["labels"].shape[0]+f['decoder_input_ids'].shape[0] for f in features])
The text was updated successfully, but these errors were encountered: