Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval_loss of the same set of data differs when using different batch size #24839

Closed
2 of 4 tasks
namespace-Pt opened this issue Jul 15, 2023 · 3 comments
Closed
2 of 4 tasks
Assignees

Comments

@namespace-Pt
Copy link
Contributor

namespace-Pt commented Jul 15, 2023

System Info

  • transformers version: 4.30.0
  • Platform: Linux-5.4.0-147-generic-x86_64-with-glibc2.31
  • Python version: 3.10.12
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

eval_loss of the same set of data from the same model (gpt-neo, flan-t5, llama...) differs when using different batch size.

from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m", padding_side="left")

# bug also happens on flan-t5
# tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
# model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

# set pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# For the following inputs, eval_loss is different when using different batch size
samples = [
    "Sheldon: So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it's unobserved it will, however, if it's observed after it's left the plane but before it hits its target, it will not have gone through both slits. Leonard: Agreed, what's your point? Sheldon: There's no point, I just think it's a good idea for a tee-shirt. Leonard: Excuse me? Receptionist: Hang on. Leonard: One across is Aegean, eight down is Nabakov, twenty-six across is MCM, fourteen down is… move your finger… phylum, which makes fourteen across Port-au-Prince. See, Papa Doc's capital idea, that's Port-au-Prince. Haiti. Receptionist: Can I help you? Leonard: Yes. Um, is this the High IQ sperm bank? Receptionist: If you have to ask, maybe you shouldn't be here. Sheldon: I think this is the place. Receptionist: Fill these out. Leonard: Thank-you. We'll be right back. Receptionist: Oh, take your time. I'll just finish my crossword puzzle. Oh wait. (They sit and begin to fill in forms). Sheldon: Leonard, I don't think I can do this. Leonard: What, are you kidding? You're a semi-pro. Sheldon: No. We are committing genetic fraud. There's no guarantee that our sperm is going to generate high IQ offspring, think about that. Sheldon: So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it's unobserved it will, however, if it's observed after it's left the plane but before it hits its target, it will not have gone through both slits. Leonard: Agreed, what's your point? Sheldon: There's no point, I just think it's a good idea for a tee-shirt. Leonard: Excuse me?",
    "Sheldon: Are you still mad about the sperm bank? Leonard: No. Sheldon: You want to hear an interesting thing about stairs? Leonard: Not really. Sheldon: If the height of a single step is off by as little as two millimetres, most people will trip. Leonard: I don't care. Two millimetres? That doesn't seem right. Sheldon: No, it's true, I did a series of experiments when I was twelve, my father broke his clavicle. Leonard: Is that why they sent you to boarding school? Sheldon: No, that was the result of my work with lasers. Leonard: New neighbour? Sheldon: Evidently. Leonard: Significant improvement over the old neighbour. Sheldon: Two hundred pound transvestite with a skin condition, yes she is. Penny: Oh, hi! Leonard: Hi. Sheldon: Hi. Leonard: Hi. Sheldon: Hi. Penny: Hi? Leonard: We don't mean to interrupt, we live across the hall. Penny: Oh, that's nice. Leonard: Oh… uh… no… we don't live together… um… we live together but in separate, heterosexual bedrooms. Penny: Oh, okay, well, guess I'm your new neighbour, Penny. Leonard: Leonard, Sheldon. Penny: Hi. Leonard: Hi. Sheldon: Hi. Penny: Hi. Leonard: Hi. Well, uh, oh, welcome to the building. Penny: Thankyou, maybe we can have coffee sometime. Leonard: Oh, great. Penny: Great. Sheldon: Great. Leonard: Great. Well, bye. Penny: Bye. Sheldon: Bye. Leonard: Bye. Leonard: Should we have invited her for lunch? Sheldon: No. We're going to start Season Two of Battlestar Galactica. Leonard: We already watched the Season Two DVDs. Sheldon: Not with commentary. Leonard: I think we should be good neighbours, invite her over, make her feel welcome. Sheldon: We never invited Louis-slash-Louise over. Leonard: Well, then that was wrong of us. Sheldon: Are you still mad about the sperm bank? Leonard: No. Sheldon: You want to hear an interesting thing about stairs? Leonard: Not really.",
    "Leonard: Okay, well, make yourself at home. Penny: Okay, thankyou. Leonard: You're very welcome. Penny: This looks like some serious stuff, Leonard, did you do this? Sheldon: Actually that's my work. Penny: Wow. Sheldon: Yeah, well, it's just some quantum mechanics, with a little string theory doodling around the edges. That part there, that's just a joke, it's a spoof of the Bourne-Oppenheimer approximation. Penny: So you're like, one of those, beautiful mind genius guys. Sheldon: Yeah. Penny: This is really impressive. Leonard: I have a board. If you like boards, this is my board. Penny: Holy smokes. Sheldon: If by holy smokes you mean a derivative restatement of the kind of stuff you can find scribbled on the wall of any men's room at MIT, sure. Leonard: What? Sheldon: Oh, come on. Who hasn't seen this differential below “here I sit broken hearted?” Leonard: At least I didn't have to invent twenty-six dimensions just to make the math come out. Sheldon: I didn't invent them, they're there. Leonard: In what universe? Sheldon: In all of them, that is the point. Penny: Uh, do you guys mind if I start? Sheldon: Um, Penny, that's where I sit. Penny: So, sit next to me. Sheldon: No, I sit there. Penny: What's the difference? Sheldon: What's the difference? Leonard: Here we go. Sheldon: In the winter that seat is close enough to the radiator to remain warm, and yet not so close as to cause perspiration. In the summer it's directly in the path of a cross breeze created by open windows there, and there. Leonard: Okay, well, make yourself at home. Penny: Okay, thankyou. Leonard: You're very welcome. Penny: This looks like some serious stuff, Leonard, did you do this?",
    "Leonard: Uh, there it goes, it sticks, I'm sorry. Penny: Okay. Thanks. Leonard: You're welcome, oh, you're going to step right, okay, I'll…. Penny: Hey, Leonard? Leonard: The hair products are Sheldon's. Penny: Um, okay. Can I ask you a favour. Leonard: A favour? Sure, you can ask me a favour, I would do you a favour for you. Penny: It's okay if you say no. Leonard: Oh, I'll probably say yes. Penny: It's just not the kind of thing you ask a guy you've just met. Leonard: Wow. Leonard: Uh, there it goes, it sticks, I'm sorry. Penny: Okay. Thanks. Leonard: You're welcome, oh, you're going to step right, okay, I'll…. Penny: Hey, Leonard?"
]

model.eval()
with torch.no_grad():
    # feed all data in one batch
    all_batch_samples = tokenizer(samples, return_tensors="pt", padding="max_length", max_length=480, truncation=True)
    labels = all_batch_samples["input_ids"].clone()
    labels[labels == tokenizer.pad_token_id] = -100
    all_batch_samples["labels"] = labels
    outputs = model(**all_batch_samples)
    all_loss = outputs.loss

    # feed one data sample per batch (batch size is 1)
    losses = []
    for i in range(len(all_batch_samples["input_ids"])):
        batch_samples = tokenizer(samples[i], return_tensors="pt", padding="max_length", max_length=480, truncation=True)
        labels = batch_samples["input_ids"].clone()
        labels[labels == tokenizer.pad_token_id] = -100
        batch_samples["labels"] = labels

        for k, v in batch_samples.items():
            # always true
            assert (all_batch_samples[k][i] == batch_samples[k]).all()
        losses.append(model(**batch_samples).loss)
    losses = torch.stack(losses)

print(f"BS=1: {losses.mean()}", "*"*5, f"BS=all: {all_loss}", "*"*5, f"Losses: {losses}")

# BS=1: 3.6513803005218506 ***** BS=all: 3.6280925273895264 ***** Losses: tensor([3.5703, 3.4178, 3.8621, 3.7554])

Expected behavior

I think the loss should be exactly the same with different batch sizes. I wonder why the deviation happens.

@ydshieh ydshieh self-assigned this Jul 24, 2023
@ydshieh
Copy link
Collaborator

ydshieh commented Jul 24, 2023

Hi @namespace-Pt

Thank you for reporting.

This is because this is a causal LM models, where the loss is computed across the non-padding tokens.

The loss (returned from the model's forward) is the total loss divided by the number of non-padding tokens sent to the model.

In your case (4 examples), they have 438, 461, 423 and 183 non-padding tokens, a total of 1505.

For each single example, the (averaged) loss is 2.5674, 2.7242, 2.9536 and 2.3945. Multiplying by the corresponding number of non-padding tokens, we get 1124.5172, 1255.8704, 1249.3870 and 438.1949. Summing them gives the total loss of 4067.9697.

Divided by 1505 (the total number of non-padding tokens in the batch), we get 4067.9697 / 1505 = 2.7031, which is the loss we get when sending the batch to the model. (There is a slight precision issue above, but it's fine)

This is known and not an real issue. However, if you want to have full control, you can call model's forward without labels and compute it in your own code.

@ydshieh
Copy link
Collaborator

ydshieh commented Jul 24, 2023

There is a more detailed discussion

#24725

@namespace-Pt
Copy link
Contributor Author

Got it. Thank you. So it should not be macro-average.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants