-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shift labels during finetuning and provide correct padding mask #101
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes me slightly nervous as there will be consequences on the results we report in the paper. We should therefore spend some time to evaluate the key metrics before and after this change. To do this, we should do the following:
- Slice phi-2 25% on alpaca, 2048 calibration samples [no change on ppl expected]
- RFT with hyperparams:
--ppl-eval-dataset alpaca --finetune-dataset alpaca --finetune-train-nsamples 8000 --finetune-train-seqlen 1024 --finetune-train-batch-size 3 --lora-alpha 10 --lora-dropout 0.05 --lora-r 32 --eval-steps 128 --lora-target-modules attn_head_and_mlp
(NB not the lm head as well, as this wasn't done in the paper) [change expected] - Lm eval on
--tasks piqa
should suffice [change expected]
The Lm eval on piqa before the change should match exactly what we report in the Table 10 in the appendix of the paper.
@@ -62,6 +62,21 @@ def get_train_dataloader(self) -> DataLoader: | |||
|
|||
def get_eval_dataloader(self, _) -> DataLoader: | |||
return self.test_loader | |||
|
|||
def compute_loss(self, model, inputs, return_outputs=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is most of the calculation for ppl can we make this one and our gpu_utils
implementation consistent, and add a test? Better still would be to call a common subroutine from here and from the the ppl calc implementation. NB Pashmin'as ppl refactoring PR should go in first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the perplexity calculation should have the same changes, good idea to combine them
labels = inputs.pop('labels') | ||
attention_mask = inputs["attention_mask"] | ||
outputs = model(**inputs) | ||
labels = labels[..., 1:].contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we should double check whether contiguous
is required here, as we don't use it in our ppl implementation.
So after some digging I found that our Instead the |
In the change that is added here the loss is being called under the |
This PR makes sure that the labels are shifted to the right when computing the loss and provides a correct mask to mask out the padding tokens while keeping the other special tokens.