Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splatfacto fails due to the model being train mode instead of eval mode #3253

Closed
kstoneriv3 opened this issue Jun 24, 2024 · 3 comments · Fixed by #3430
Closed

Splatfacto fails due to the model being train mode instead of eval mode #3253

kstoneriv3 opened this issue Jun 24, 2024 · 3 comments · Fixed by #3430
Labels
bug Something isn't working

Comments

@kstoneriv3
Copy link
Contributor

kstoneriv3 commented Jun 24, 2024

Describe the bug
After 2000 steps of ns-train command (nerfstudio==1.1.3), the following happens.

Printing profiling stats, from longest to shortest duration in seconds
VanillaPipeline.get_average_eval_image_metrics: 0.2351
VanillaPipeline.get_average_image_metrics: 0.2226
VanillaPipeline.get_eval_image_metrics_and_images: 0.0689
Trainer.train_iteration: 0.0501
VanillaPipeline.get_train_loss_dict: 0.0414
Trainer.eval_iteration: 0.0010
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 262, in entrypoint
    main(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 247, in main
    launch(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 189, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 100, in train_loop
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 298, in train
    self.eval_iteration(step)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/decorators.py", line 70, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 545, in eval_iteration
    metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/pipelines/base_pipeline.py", line 341, in get_eval_image_metrics_and_images
    metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/models/splatfacto.py", line 926, in get_image_metrics_and_images
    combined_rgb = torch.cat([gt_rgb, predicted_rgb], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 239 but got size 959 for tensor number 1 in the list.

To Reproduce
It happens locally while using the viewer but I don't have a simple way of reproduction.

Expected behavior
No RuntimeError is raised here.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
When I set the PDB debugger there for quick debugging, I saw that the model is in train mode instead of eval mode somehow. Replacing

metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
with the following fixed the error for me but it probably needs more permanent fix.

        try:
            metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
        except Exception:
            self.model.eval()  # The code fails here due to model.training == True for some reason.
            metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)

It is cryptic why the model is still in train mode at this line but that's how it worked for me.

I had the same error at the following line and the same fix worked.

metrics_dict, image_dict = self.model.get_image_metrics_and_images(outputs, batch)

An additional note (2 weeks after initial posting)

I found that the above try-except statement still fails at the except block in some cases if I make lots of interaction from the viewer when the number of images are large. Probably the following is a better (temporary) fix.

        while True:
            try:
                metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
                break
            except Exception:
                self.model.eval()  # The code fails here due to model.training == True for some reason.
@jb-ye
Copy link
Collaborator

jb-ye commented Jun 25, 2024

I have seen similar issue before and don't yet have clue how viewer can mutate training/eval mode of models.

@jb-ye jb-ye added the bug Something isn't working label Jun 25, 2024
@Jovp
Copy link

Jovp commented Jul 18, 2024

Seeing the same issue here, while using both viewer and tensorboard. My take is that the eval images are already downsampled when loaded according to the path and are getting downsampled again.

The error arises during evaluation

@LaFeuilleMorte
Copy link

Same issue, waiting for fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants