Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about input of VLDM #16

Open
zhaojiancheng007 opened this issue Oct 16, 2023 · 6 comments
Open

Questions about input of VLDM #16

zhaojiancheng007 opened this issue Oct 16, 2023 · 6 comments

Comments

@zhaojiancheng007
Copy link

Excellent job!
I've got some questions about the code.
image
The input of VLDM is from batch_rgb(aka query_rgb)? But we didn't have the query_rgb at the training stage?
And where the z_scale_factor from? why it was fixed to be 0,18215?
Thanks!

@zhizdev
Copy link
Owner

zhizdev commented Oct 16, 2023

Hi!

batch_rgb is a batch of RGB images. We encode batch_rgb into batch latents or images_z. Our model predicts the latent, images_z, and not the RGB image directly. We use a frozen VAE.

z_scale_factor is a hyperparameter from stable diffusion (they multiply the VAE latents by it before performing diffusion).

@zhaojiancheng007
Copy link
Author

Thanks for your prompt reply!

Maybe I git it wrong. The objective of this paper is given 2 views, generate new views directly. So at the trainning stage, all we got is input_rgb(context view), and the batch_rgb(query view) is what we want to generate. How can it be used as input to vae encoder and diffusion model?

Or this framework is not instance-specific. We train the model on a lot of instances, and generalize to new instance at distillation stage? If so, can you explain a little bit more about it?

Thank you so much for your help!

@zhaojiancheng007
Copy link
Author

Sorry I got another question, have you tried single view reconstruction on this framework?

Thanks for your help!

@zhizdev
Copy link
Owner

zhizdev commented Oct 17, 2023

VLDM is not instance specific. It is trained on a lot of instances. During training, the model sees input_rgb, input_cameras, and query_cameras. We have ground truth for query_rgb, which we supervise the model with. Since we use a latent diffusion model, the model predicts latent codes internally.

Single view reconstruction requires a bit more careful consideration since the object scale in ambiguous.

@zhaojiancheng007
Copy link
Author

Thanks for your patience. I may not ask my question in a right way.
Untitled

This view which is batch_rgb in the code and will be fed into a vae encoder, is the query_vrgb? At the training stage, you use the ground truth query_rgb to the vae encoder, ad use the latent as x_0 of the diffusion model?

I've been trying your model on a custom dataset. you said batch_rgb is a batch of images. So I got out of dataloader for one iteration [1, B, H, W], can I understand this 4 dimension in the following:
1 --> batch_size of dataloader
B --> sequence length or sample_batch_size pre_defined

So as the same, the camera parameters right out of dataloader should also be the same shape, for example:
T --> [1, B, 3, 1]
R --> [1, B, 3, 3]
focal_length --> [1, B, 2, 2]
principal_point --> [1, B, 2,2]

I've made my own dataset and dataloader to output data follow the above form, but everytime when need to compute the inverse transform, the code will end a error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
Untitled (1)
Untitled (2)
Have you ever met this error in your experiment?

@zhizdev
Copy link
Owner

zhizdev commented Oct 20, 2023

Usually cublas_status_not_initialized means the program or environment cannot find a cuda gpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants