Splatfacto fails due to the model being train mode instead of eval mode #3253

kstoneriv3 · 2024-06-24T07:35:48Z

Describe the bug
After 2000 steps of ns-train command (nerfstudio==1.1.3), the following happens.

Printing profiling stats, from longest to shortest duration in seconds
VanillaPipeline.get_average_eval_image_metrics: 0.2351
VanillaPipeline.get_average_image_metrics: 0.2226
VanillaPipeline.get_eval_image_metrics_and_images: 0.0689
Trainer.train_iteration: 0.0501
VanillaPipeline.get_train_loss_dict: 0.0414
Trainer.eval_iteration: 0.0010
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 262, in entrypoint
    main(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 247, in main
    launch(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 189, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 100, in train_loop
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 298, in train
    self.eval_iteration(step)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/decorators.py", line 70, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 545, in eval_iteration
    metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/pipelines/base_pipeline.py", line 341, in get_eval_image_metrics_and_images
    metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/models/splatfacto.py", line 926, in get_image_metrics_and_images
    combined_rgb = torch.cat([gt_rgb, predicted_rgb], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 239 but got size 959 for tensor number 1 in the list.

To Reproduce
It happens locally while using the viewer but I don't have a simple way of reproduction.

Expected behavior
No RuntimeError is raised here.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
When I set the PDB debugger there for quick debugging, I saw that the model is in train mode instead of eval mode somehow. Replacing

nerfstudio/nerfstudio/pipelines/base_pipeline.py

Line 341 in 9b3cbc7

    
           metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)

with the following fixed the error for me but it probably needs more permanent fix.

        try:
            metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
        except Exception:
            self.model.eval()  # The code fails here due to model.training == True for some reason.
            metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)

It is cryptic why the model is still in train mode at this line but that's how it worked for me.

I had the same error at the following line and the same fix worked.

nerfstudio/nerfstudio/pipelines/base_pipeline.py

Line 388 in 9b3cbc7

    
           metrics_dict, image_dict = self.model.get_image_metrics_and_images(outputs, batch)

An additional note (2 weeks after initial posting)

I found that the above try-except statement still fails at the except block in some cases if I make lots of interaction from the viewer when the number of images are large. Probably the following is a better (temporary) fix.

        while True:
            try:
                metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
                break
            except Exception:
                self.model.eval()  # The code fails here due to model.training == True for some reason.

The text was updated successfully, but these errors were encountered:

jb-ye · 2024-06-25T17:17:29Z

I have seen similar issue before and don't yet have clue how viewer can mutate training/eval mode of models.

Jovp · 2024-07-18T14:38:03Z

Seeing the same issue here, while using both viewer and tensorboard. My take is that the eval images are already downsampled when loaded according to the path and are getting downsampled again.

The error arises during evaluation

LaFeuilleMorte · 2024-08-16T09:34:59Z

Same issue, waiting for fix

jb-ye added the bug Something isn't working label Jun 25, 2024

aayushg55 mentioned this issue Sep 16, 2024

Fix splatfacto crash in eval when using viewer #3430

Merged

brentyi closed this as completed in #3430 Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splatfacto fails due to the model being train mode instead of eval mode #3253

Splatfacto fails due to the model being train mode instead of eval mode #3253

kstoneriv3 commented Jun 24, 2024 •

edited

Loading

jb-ye commented Jun 25, 2024

Jovp commented Jul 18, 2024 •

edited

Loading

LaFeuilleMorte commented Aug 16, 2024

Splatfacto fails due to the model being train mode instead of eval mode #3253

Splatfacto fails due to the model being train mode instead of eval mode #3253

Comments

kstoneriv3 commented Jun 24, 2024 • edited Loading

An additional note (2 weeks after initial posting)

jb-ye commented Jun 25, 2024

Jovp commented Jul 18, 2024 • edited Loading

LaFeuilleMorte commented Aug 16, 2024

kstoneriv3 commented Jun 24, 2024 •

edited

Loading

Jovp commented Jul 18, 2024 •

edited

Loading