Skip to content

Conversation

@fabianlim
Copy link
Collaborator

@fabianlim fabianlim commented Jul 23, 2025

Actually I believe that padding-free (packing) is correct.

  • this PR fixes the non-packing case, where concatenated_forward assumes the pad_token_id = 0 by mistake
  • the seq_idx in the padding free concatenated_forward was computed wrongly.
  • NOTE: also reverted Remove Unused DPO Function allenai/open-instruct#794 that removed our preference_span_search function.

Checking

  • purple: no packing
  • brown: packing
  • purple (dotted): packing before fix
image

Signed-off-by: Yu Chin Fabian Lim <[email protected]>
simpo_loss,
wpo_loss,
)
from open_instruct.padding_free_collator import (
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes there is a dup two lines belwo

if not packing:
concatenated_batch = concatenated_inputs(batch)
try:
pad_token_id = model.tokenizer.pad_token_id
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do models usually have a tokenizer attr? I'm finding not, for bamba and granite. Unless the attr is set somewhere after init.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they will usually have, though the model is typed as torch.nn.Module so I thought in the OI use-case it may not be gauranteed.

fabianlim and others added 4 commits July 24, 2025 13:44
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Copy link
Owner

@garrett361 garrett361 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question.

@garrett361
Copy link
Owner

also, quality checks failing

Signed-off-by: Yu Chin Fabian Lim <[email protected]>
@fabianlim
Copy link
Collaborator Author

@garrett361 fixed conflict and linted the files effectively chaged in this commit. But because we had to pull upstream, there are alot of files not linted

@garrett361 garrett361 merged commit 1003a02 into main Aug 6, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants