Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding the behavior of max_seq_length in SFTTrainer #2400

Open
Taiki-azrs opened this issue Nov 27, 2024 · 4 comments
Open

Regarding the behavior of max_seq_length in SFTTrainer #2400

Taiki-azrs opened this issue Nov 27, 2024 · 4 comments
Labels
📚 documentation Improvements or additions to documentation 👶 good first issue Good for newcomers 🏋 SFT Related to SFT

Comments

@Taiki-azrs
Copy link

The SFTTrainer documentation states:

SFTTrainer always pads by default the sequences to the max_seq_length argument of the SFTTrainer.
https://huggingface.co/docs/trl/main/en/sft_trainer#best-practices

However, when looking at the actual code, the Tokenizer appears to have padding=False, and it does not seem to pad sequences to the max_seq_length value.
https://github.com/huggingface/trl/blob/main/trl/trainer/sft_trainer.py#L420

How does SFTTrainer ensure that sequences are padded to the max_seq_length value?

@qgallouedec
Copy link
Member

You're right, the documentation is wrong. Would you like to contribute by correcting it?

@qgallouedec qgallouedec added 📚 documentation Improvements or additions to documentation 👶 good first issue Good for newcomers 🏋 SFT Related to SFT labels Dec 13, 2024
@umbilnm
Copy link
Contributor

umbilnm commented Dec 25, 2024

Hi! I can correct it. Based on the discussion, it seems we could take one of two approaches:

  1. Completely remove this mention from the “Best Practices” section
  2. Update the text to clarify that truncation (rather than padding) happens by default.
    Could you let me know which approach is better?

@qgallouedec
Copy link
Member

Probably the 2. What do you think?

@umbilnm
Copy link
Contributor

umbilnm commented Dec 25, 2024

Ok, also if max_seq_len isn't specified the trainer sets it to min(1024, tokenizer.model_max_lenght) (not 2048), so revised text may look like this:

SFTTrainer truncates sequences by default to the max_seq_length specified. If max_seq_length is not provided, the trainer sets it to the minimum of tokenizer.model_max_length and 1024. Ensure you verify this setting before training to avoid unintended behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📚 documentation Improvements or additions to documentation 👶 good first issue Good for newcomers 🏋 SFT Related to SFT
Projects
None yet
Development

No branches or pull requests

3 participants