reproducing the aesthetic experiment #3

seashell123 · 2023-07-12T01:20:24Z

I am trying to reproduce the aesthetic experiment on a single GPU. I made the following changes to the config:

config.sample.batch_size = 1
config.sample.num_batches_per_epoch = 256
config.train.batch_size = 1
config.train.gradient_accumulation_steps = 128

My results are summarized in the following figure:

Few questions I have regarding the results:

The paper generates "stylized line drawings". However, neither the reference nor my results for ddpo-pytorch show this behavior.
Why is it that compared with the paper reward curve, my reward (5.5) already starts off higher than the end of the paper's reward curve (5.1)?
Are there reference reward curves corresponding to teaser.jpg that we could compare with for all four experiments?

The text was updated successfully, but these errors were encountered:

kvablack · 2023-07-13T18:42:38Z

@seashell123 Great questions! You're 100% right that there's a big inconsistency between the aesthetic quality results in the original paper and this PyTorch release. I do know the reasons, I just haven't had the chance to document them anywhere; I suppose here is as good of a place as any.

TL;DR: the "stylized line drawing" results are only reproducible on TPUs

When I first went to reproduce our results in PyTorch, I found the same thing as you -- that the aesthetic quality experiment was not working. As a sanity check, I ran the PyTorch aesthetic scorer and Jax aesthetic scorer on the same images and found that they produced completely different results! This explains both the qualitative difference in the images and the quantitative difference in the reward curves.

After some debugging, I traced the problem to the CLIP embedding. The LAION aesthetic scorer is implemented as a small MLP applied to the CLIP embedding of the image. The problem wasn't in the MLP, but in the CLIP image encoder. The Jax and PyTorch versions of the HuggingFace transformers CLIPModel [1] were giving completely different embeddings for images that were identical down to the pixel. For the image I tested, the cosine similarity was only 0.8.

After a bunch more investigation (that was more like trial-and-error via banging my head against a wall in slightly different ways than proper debugging), I finally made a surprising discovery -- the difference was not actually being caused by Jax vs. PyTorch, but by floating point formats. I was running all of the Jax code on TPUs, which use the bfloat16 format for multiplication by default [2]. This is implemented automatically at a hardware level, meaning that it happens even if all of your inputs, outputs, and parameters are float32. Turns out that if you run your Jax code on a CPU or GPU, you do get identical CLIP embeddings (and thus, aesthetic scores) as PyTorch running on a CPU or GPU. Similarly, if you call .bfloat16() on your model in PyTorch, you get similar results as the Jax + TPU version!

The key word in that last sentence is "similar". Using PyTorch's .bfloat16() produces aesthetic scores that are close enough to the Jax + TPU version to convince me that bfloat16 is indeed the culprit. However, casting all of the model weights, inputs, and outputs to bfloat16 is definitely not the exact same as whatever a TPU does at a hardware level [3]. I tried to reproduce the paper results using this PyTorch .bfloat16() version of the aesthetic scorer, but I had no luck, probably due to these tiny differences. This is the point at which I gave up.

To cap it off, I finally thought to check these two versions of the aesthetic scorer against ground-truth images from the LAION dataset. As you might expect, the less precise TPU/bfloat16 version is totally wrong -- it gives the ugliest meme a similar score to the most aesthetically beautiful painting.

The grand irony of all this is that even though this means the aesthetic quality results in the paper are totally wrong, I personally think the "stylized line drawings" we got are way more eye-catching than the "correct" results. In a way, they're made even cooler by the fact that they were produced by some bizarre deep learning alchemy involving the interaction between bfloat16, a CLIP ViT, and a pretrained MLP.

Moving forward, I'm definitely going to re-run the experiments and update the paper. It'll be sad to see the stylized line drawings go, though; maybe I'll keep them in the appendix somewhere. You're also welcome to pick up where I left off and try to exactly reproduce the weird TPU/bfloat16 aesthetic scores without a TPU. Regardless, thanks for taking such an interest in DDPO, and for motivating me to finally write up this story!

[1] Spoiler: it was not HuggingFace's fault! And it turns out they are aware of the issue. However, it might be nice to add a big warning on the Flax versions of their models since the difference was so shockingly large (at least for the CLIP image encoder).
[2] Apparently this is possible to change -- however, it's not really the kind of thing you find out about until you run directly into it by having an issue like this.
[3] If you really want to get into the details, the bfloat16 blog post says that each multiply-accumulate uses bfloat16 for multiplication and float32 for accumulation. The same blog post also says that "while it is possible to observe the effects of bfloat16, this typically requires careful numerical analysis of the computation’s outputs," so... lol.

kvablack · 2023-07-13T19:38:52Z

@seashell123 With regard to your last question, I just added reference reward curves to the readme!

seashell123 · 2023-07-14T01:51:44Z

Thank you for such a detailed explanation, @kvablack! Good to know. And thanks for posting the reference reward curves!

JacobYuan7 · 2023-07-19T03:27:04Z

It is indeed an detailed explanation that solves the concern!

bhattg · 2023-09-18T00:48:22Z

Hi @kvablack I just saw that there are some new reward curves for the aesthetic score, which seems to be working well now. can you please let me know what changed between the reward curves of the older and new readme?

kvablack · 2023-09-18T04:17:48Z

@bhattg Yes, I re-ran the aesthetic experiment with the gradient synchronization fixed (#10) and it did seem to perform better.

bhattg · 2023-09-18T18:35:52Z

Thanks for the quick response!

kvablack mentioned this issue Jul 13, 2023

Make the default matmul precision float32 even on TPUs google/jax#7010

Open

Repository owner deleted a comment from noshaba Jul 13, 2023

seashell123 closed this as completed Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproducing the aesthetic experiment #3

reproducing the aesthetic experiment #3

seashell123 commented Jul 12, 2023

kvablack commented Jul 13, 2023

kvablack commented Jul 13, 2023

seashell123 commented Jul 14, 2023

JacobYuan7 commented Jul 19, 2023

bhattg commented Sep 18, 2023

kvablack commented Sep 18, 2023

bhattg commented Sep 18, 2023

reproducing the aesthetic experiment #3

reproducing the aesthetic experiment #3

Comments

seashell123 commented Jul 12, 2023

kvablack commented Jul 13, 2023

kvablack commented Jul 13, 2023

seashell123 commented Jul 14, 2023

JacobYuan7 commented Jul 19, 2023

bhattg commented Sep 18, 2023

kvablack commented Sep 18, 2023

bhattg commented Sep 18, 2023