Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pad prompts to the right in T5 examples and add EOS token to seq2seq prompts #422

Merged

Conversation

mikljohansson
Copy link
Contributor

The default T5 tokenizer will generate padded input_ids tensors like below (0 is BOS and PAD token, 1 is EOS token)

prompt = ["Some prompt</s><pad><pad>", "A longer prompt</s>"]
input_ids = [[123, 345, ..., 1, 0, 0], [543, 123, ..., 1]]

I think there was two issues that might cause the inputs to T5 to be a bit different than what the model is trained for

  • The T5 example scripts didn't configure the padding_side for the tokenizer, so it defaults to "left" which caused prompts to be left-padded instead of right-padded
  • The PromptPipeline class applied add_special_tokens=False when tokenizing prompts, causing the EOS </s> to not be added to the prompt for seq2seq models

@alexandremuzio
Copy link
Contributor

Hi @mikljohansson , I'm not sure I quite got it, did this also affect training?
Also, do you have an example of before/after tokenizations?

@mikljohansson
Copy link
Contributor Author

Yep. Without this fix I found that the tokenized seq2seq prompts looked like this

Before:

prompt = ["<pad><pad>Some prompt</s>", "A longer prompt</s>"]
input_ids = [[0, 0, 123, 345, ..., 1, 0, 0], [543, 123, ..., 1]]

But with this change applied the prompts look like what the default T5 tokenizer would create, which is

Fixed:

prompt = ["Some prompt</s><pad><pad>", "A longer prompt</s>"]
input_ids = [[123, 345, ..., 1, 0, 0], [543, 123, ..., 1]]

For a decoder-only model it would be correct to left-pad so the model can continue completing immediately after the prompt. However for seq2seq/encoder-decoder models the default behavior seems to be to pad on the right side of the prompts. I'm not quite sure how much practical difference it makes in training, given that the PAD tokens aren't part of the attention_mask anyways I guess.

Anecdotally I've observed a not insignificant improvement in the loss curve when training a seq2seq models with and without the fixes in the 3 PR's I made, but I haven't measured exactly what the effects are in detail. I'd been walking through the code with a debugger because I was having some issues getting it to work, and fixed a few of the issues that I think I found while I was debugging. Hope it can help :)

Copy link
Collaborator

@maxreciprocate maxreciprocate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, it's looking good! Although I don't see no difference between left and right padding using this trivial example, there is a difference with regards to add_special_tokens when compared to the version from the main. Also there is an idea to pin padding_side #424 and make it immutable/selected depending on the architecture used.

https://wandb.ai/sorry/trlx/reports/Pad-prompts-to-the-right-in-T5-examples-and-add-EOS-token-to-seq2seq-prompts-422---Vmlldzo0MTk3NDY2

import trlx
from trlx.data.default_configs import default_ppo_config

config = default_ppo_config()

config.tokenizer.padding_side = "left"
config.tokenizer.tokenizer_path = config.model.model_path = "t5-small"
config.model.model_arch_type = "seq2seq"

config.method.gen_kwargs["max_new_tokens"] = 2
config.train.batch_size = 8
config.train.total_steps = 1024

prompts = ["Hi!", "Hey!", "Good morning yall!", "Yo!"] * 4

trlx.train(
    reward_fn=lambda samples, **kwargs: [s.endswith("!") for s in samples],
    prompts=prompts,
    eval_prompts=prompts,
    config=config,
)

@maxreciprocate maxreciprocate merged commit 70ca6c6 into CarperAI:main Apr 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants