Questions about input of VLDM #16

zhaojiancheng007 · 2023-10-16T14:53:20Z

Excellent job!
I've got some questions about the code.

The input of VLDM is from batch_rgb(aka query_rgb)? But we didn't have the query_rgb at the training stage?
And where the z_scale_factor from? why it was fixed to be 0,18215?
Thanks!

zhizdev · 2023-10-16T19:02:30Z

Hi!

batch_rgb is a batch of RGB images. We encode batch_rgb into batch latents or images_z. Our model predicts the latent, images_z, and not the RGB image directly. We use a frozen VAE.

z_scale_factor is a hyperparameter from stable diffusion (they multiply the VAE latents by it before performing diffusion).

zhaojiancheng007 · 2023-10-17T01:03:23Z

Thanks for your prompt reply!

Maybe I git it wrong. The objective of this paper is given 2 views, generate new views directly. So at the trainning stage, all we got is input_rgb(context view), and the batch_rgb(query view) is what we want to generate. How can it be used as input to vae encoder and diffusion model?

Or this framework is not instance-specific. We train the model on a lot of instances, and generalize to new instance at distillation stage? If so, can you explain a little bit more about it?

Thank you so much for your help!

zhaojiancheng007 · 2023-10-17T01:39:27Z

Sorry I got another question, have you tried single view reconstruction on this framework?

Thanks for your help!

zhizdev · 2023-10-17T02:25:21Z

VLDM is not instance specific. It is trained on a lot of instances. During training, the model sees input_rgb, input_cameras, and query_cameras. We have ground truth for query_rgb, which we supervise the model with. Since we use a latent diffusion model, the model predicts latent codes internally.

Single view reconstruction requires a bit more careful consideration since the object scale in ambiguous.

zhaojiancheng007 · 2023-10-20T18:37:43Z

Thanks for your patience. I may not ask my question in a right way.

This view which is batch_rgb in the code and will be fed into a vae encoder, is the query_vrgb? At the training stage, you use the ground truth query_rgb to the vae encoder, ad use the latent as x_0 of the diffusion model?

I've been trying your model on a custom dataset. you said batch_rgb is a batch of images. So I got out of dataloader for one iteration [1, B, H, W], can I understand this 4 dimension in the following:
1 --> batch_size of dataloader
B --> sequence length or sample_batch_size pre_defined

So as the same, the camera parameters right out of dataloader should also be the same shape, for example:
T --> [1, B, 3, 1]
R --> [1, B, 3, 3]
focal_length --> [1, B, 2, 2]
principal_point --> [1, B, 2,2]

I've made my own dataset and dataloader to output data follow the above form, but everytime when need to compute the inverse transform, the code will end a error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

Have you ever met this error in your experiment?

zhizdev · 2023-10-20T21:58:40Z

Usually cublas_status_not_initialized means the program or environment cannot find a cuda gpu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about input of VLDM #16

Questions about input of VLDM #16

zhaojiancheng007 commented Oct 16, 2023

zhizdev commented Oct 16, 2023

zhaojiancheng007 commented Oct 17, 2023

zhaojiancheng007 commented Oct 17, 2023

zhizdev commented Oct 17, 2023

zhaojiancheng007 commented Oct 20, 2023

zhizdev commented Oct 20, 2023

Questions about input of VLDM #16

Questions about input of VLDM #16

Comments

zhaojiancheng007 commented Oct 16, 2023

zhizdev commented Oct 16, 2023

zhaojiancheng007 commented Oct 17, 2023

zhaojiancheng007 commented Oct 17, 2023

zhizdev commented Oct 17, 2023

zhaojiancheng007 commented Oct 20, 2023

zhizdev commented Oct 20, 2023