Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducing the aesthetic experiment #3

Closed
seashell123 opened this issue Jul 12, 2023 · 7 comments
Closed

reproducing the aesthetic experiment #3

seashell123 opened this issue Jul 12, 2023 · 7 comments

Comments

@seashell123
Copy link

I am trying to reproduce the aesthetic experiment on a single GPU. I made the following changes to the config:

config.sample.batch_size = 1
config.sample.num_batches_per_epoch = 256
config.train.batch_size = 1
config.train.gradient_accumulation_steps = 128

My results are summarized in the following figure:
image

Few questions I have regarding the results:

  1. The paper generates "stylized line drawings". However, neither the reference nor my results for ddpo-pytorch show this behavior.
  2. Why is it that compared with the paper reward curve, my reward (5.5) already starts off higher than the end of the paper's reward curve (5.1)?
  3. Are there reference reward curves corresponding to teaser.jpg that we could compare with for all four experiments?
@kvablack
Copy link
Owner

@seashell123 Great questions! You're 100% right that there's a big inconsistency between the aesthetic quality results in the original paper and this PyTorch release. I do know the reasons, I just haven't had the chance to document them anywhere; I suppose here is as good of a place as any.

TL;DR: the "stylized line drawing" results are only reproducible on TPUs

When I first went to reproduce our results in PyTorch, I found the same thing as you -- that the aesthetic quality experiment was not working. As a sanity check, I ran the PyTorch aesthetic scorer and Jax aesthetic scorer on the same images and found that they produced completely different results! This explains both the qualitative difference in the images and the quantitative difference in the reward curves.

After some debugging, I traced the problem to the CLIP embedding. The LAION aesthetic scorer is implemented as a small MLP applied to the CLIP embedding of the image. The problem wasn't in the MLP, but in the CLIP image encoder. The Jax and PyTorch versions of the HuggingFace transformers CLIPModel [1] were giving completely different embeddings for images that were identical down to the pixel. For the image I tested, the cosine similarity was only 0.8.

After a bunch more investigation (that was more like trial-and-error via banging my head against a wall in slightly different ways than proper debugging), I finally made a surprising discovery -- the difference was not actually being caused by Jax vs. PyTorch, but by floating point formats. I was running all of the Jax code on TPUs, which use the bfloat16 format for multiplication by default [2]. This is implemented automatically at a hardware level, meaning that it happens even if all of your inputs, outputs, and parameters are float32. Turns out that if you run your Jax code on a CPU or GPU, you do get identical CLIP embeddings (and thus, aesthetic scores) as PyTorch running on a CPU or GPU. Similarly, if you call .bfloat16() on your model in PyTorch, you get similar results as the Jax + TPU version!

The key word in that last sentence is "similar". Using PyTorch's .bfloat16() produces aesthetic scores that are close enough to the Jax + TPU version to convince me that bfloat16 is indeed the culprit. However, casting all of the model weights, inputs, and outputs to bfloat16 is definitely not the exact same as whatever a TPU does at a hardware level [3]. I tried to reproduce the paper results using this PyTorch .bfloat16() version of the aesthetic scorer, but I had no luck, probably due to these tiny differences. This is the point at which I gave up.

To cap it off, I finally thought to check these two versions of the aesthetic scorer against ground-truth images from the LAION dataset. As you might expect, the less precise TPU/bfloat16 version is totally wrong -- it gives the ugliest meme a similar score to the most aesthetically beautiful painting.

The grand irony of all this is that even though this means the aesthetic quality results in the paper are totally wrong, I personally think the "stylized line drawings" we got are way more eye-catching than the "correct" results. In a way, they're made even cooler by the fact that they were produced by some bizarre deep learning alchemy involving the interaction between bfloat16, a CLIP ViT, and a pretrained MLP.

Moving forward, I'm definitely going to re-run the experiments and update the paper. It'll be sad to see the stylized line drawings go, though; maybe I'll keep them in the appendix somewhere. You're also welcome to pick up where I left off and try to exactly reproduce the weird TPU/bfloat16 aesthetic scores without a TPU. Regardless, thanks for taking such an interest in DDPO, and for motivating me to finally write up this story!

[1] Spoiler: it was not HuggingFace's fault! And it turns out they are aware of the issue. However, it might be nice to add a big warning on the Flax versions of their models since the difference was so shockingly large (at least for the CLIP image encoder).
[2] Apparently this is possible to change -- however, it's not really the kind of thing you find out about until you run directly into it by having an issue like this.
[3] If you really want to get into the details, the bfloat16 blog post says that each multiply-accumulate uses bfloat16 for multiplication and float32 for accumulation. The same blog post also says that "while it is possible to observe the effects of bfloat16, this typically requires careful numerical analysis of the computation’s outputs," so... lol.

@kvablack
Copy link
Owner

@seashell123 With regard to your last question, I just added reference reward curves to the readme!

@seashell123
Copy link
Author

Thank you for such a detailed explanation, @kvablack! Good to know. And thanks for posting the reference reward curves!

@JacobYuan7
Copy link

It is indeed an detailed explanation that solves the concern!

@bhattg
Copy link

bhattg commented Sep 18, 2023

Hi @kvablack I just saw that there are some new reward curves for the aesthetic score, which seems to be working well now. can you please let me know what changed between the reward curves of the older and new readme?

@kvablack
Copy link
Owner

@bhattg Yes, I re-ran the aesthetic experiment with the gradient synchronization fixed (#10) and it did seem to perform better.

@bhattg
Copy link

bhattg commented Sep 18, 2023

Thanks for the quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants