Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shift labels during finetuning and provide correct padding mask #101

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

LianaMikael
Copy link
Contributor

This PR makes sure that the labels are shifted to the right when computing the loss and provides a correct mask to mask out the padding tokens while keeping the other special tokens.

Copy link
Collaborator

@nailimixaM nailimixaM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes me slightly nervous as there will be consequences on the results we report in the paper. We should therefore spend some time to evaluate the key metrics before and after this change. To do this, we should do the following:

  1. Slice phi-2 25% on alpaca, 2048 calibration samples [no change on ppl expected]
  2. RFT with hyperparams: --ppl-eval-dataset alpaca --finetune-dataset alpaca --finetune-train-nsamples 8000 --finetune-train-seqlen 1024 --finetune-train-batch-size 3 --lora-alpha 10 --lora-dropout 0.05 --lora-r 32 --eval-steps 128 --lora-target-modules attn_head_and_mlp (NB not the lm head as well, as this wasn't done in the paper) [change expected]
  3. Lm eval on --tasks piqa should suffice [change expected]

The Lm eval on piqa before the change should match exactly what we report in the Table 10 in the appendix of the paper.

@@ -62,6 +62,21 @@ def get_train_dataloader(self) -> DataLoader:

def get_eval_dataloader(self, _) -> DataLoader:
return self.test_loader

def compute_loss(self, model, inputs, return_outputs=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is most of the calculation for ppl can we make this one and our gpu_utils implementation consistent, and add a test? Better still would be to call a common subroutine from here and from the the ppl calc implementation. NB Pashmin'as ppl refactoring PR should go in first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the perplexity calculation should have the same changes, good idea to combine them

labels = inputs.pop('labels')
attention_mask = inputs["attention_mask"]
outputs = model(**inputs)
labels = labels[..., 1:].contiguous()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we should double check whether contiguous is required here, as we don't use it in our ppl implementation.

@nailimixaM
Copy link
Collaborator

So after some digging I found that our CustomTrainer's loss_fn never gets called! I confirmed this by commenting out the loss_fn -> the code still runs fine.

Instead the Trainer's default compute_metrics is called, which calculates the correct loss here. The LabelSmoother with smoothing 0 (by default) calculates the crossentropy loss, using a padding mask based off of the ignore token. @LianaMikael could you check if that implementation makes sense, or if we need to override it? Thanks

@LianaMikael
Copy link
Contributor Author

So after some digging I found that our CustomTrainer's loss_fn never gets called! I confirmed this by commenting out the loss_fn -> the code still runs fine.

Instead the Trainer's default compute_metrics is called, which calculates the correct loss here. The LabelSmoother with smoothing 0 (by default) calculates the crossentropy loss, using a padding mask based off of the ignore token. @LianaMikael could you check if that implementation makes sense, or if we need to override it? Thanks

In the change that is added here the loss is being called under the compute_loss function. I think we need to use the padding mask that we created rather than just ignoring all padding tokens since padding tokens are used for other special characters too, hence this change should take care of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants