Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging validation metrics with val_check_interval #5070

Closed
tchang1997 opened this issue Dec 10, 2020 · 5 comments · Fixed by #5351
Closed

Logging validation metrics with val_check_interval #5070

tchang1997 opened this issue Dec 10, 2020 · 5 comments · Fixed by #5351
Labels
question Further information is requested

Comments

@tchang1997
Copy link

tchang1997 commented Dec 10, 2020

❓ Questions and Help

What is your question?

My training epoch is very long, so I'd like to run mini-validation loops in the middle of the epoch, which I can do easily by setting val_check_interval=N. However, I'd also like to log the values outputted by those mini-validation loops.

Is there a canonical way to log validation metrics within a training epoch while using val_check_interval in the Trainer?

Related issues:

#4980 -- However, I'm not trying to log step-wise validation metrics; I'm trying to log aggregate validation epoch metrics.
#4985 -- Similar, but I'm using the WandbLogger directly while this issue is more general.
#4652 -- Similar issue, but this is an older version of pytorch-lightning.

Code

The below is my code for the methods that involve logging. Unfortunately, I can't post the full repo; here's the relevant configuration (loaded from YAML) for the Trainer and Logger objects, which I've replicated in a Colab in the below comment.

trainer:
    gradient_clip_val: 0.1
    accumulate_grad_batches: 8
    num_sanity_val_steps: 2
    gpus: 1
    max_epochs: 1
    val_check_interval: 2000
    limit_val_batches: 100
    move_metrics_to_cpu: True
logging:
    save_dir: "sentcomp_logs/"
    project: "sent-comp"
    offline: False
    name: "bart_full_run"

The following is the actual logging code. For context, I am training a Seq2Seq model from HuggingFace.

# THIS WORKS FINE!
    def training_step(self, batch: BatchEncoding, batch_idx: int) -> torch.Tensor:
        outputs = self(**batch)
        loss = outputs[0]
        self.log("train_loss", loss, prog_bar=True, sync_dist=True)
        return loss

# DOES NOT WORK
    def validation_step(self, batch: BatchEncoding, batch_idx: int) -> Dict[str, torch.Tensor]:
        outputs = self(**batch)
        val_loss = outputs[0]
        predictions, targets = self.decode(batch) # returns lists of strings
        results = self.score(predictions, targets) # returns a dictionary
        self.log("val_loss", val_loss, sync_dist=True)
        self.log_dict(results, sync_dist=True)
        return results

Expected: Logs to my wandb project.
Actual: A warning wandb: WARNING Step must only increase in log calls.

It seems like the global step of the validator is out of sync with my logging calls, as I get a message like

Step 249 < 2000; dropping { <my metrics> }
Step 499 < 4000; dropping { <my metrics> }
Step 749 < 6000; dropping { <my metrics> }

The global step numbers are definitely consistent with the val_check_interval provided (2000 * n). However, I am unsure why the validation logging step results in an attempt to log to steps in increments of 250. Maybe I need to somehow sync the validation logging step to the training logging step? Or is pytorch lightning supposed to take care of this interally?

Is this a bug, or am I doing something wrong/silly?

What have you tried?

Removing the log_dict call from validation steps and instead logging an aggregated dictionary in validation_epoch_end:

def validation_epoch_end(self, batches: List[Dict[str, torch.Tensor]]) -> None:
    metrics = {}         
    metrics[metric_key] = torch.stack([batch[metric_key] for batch in batches]).mean()  
    self.log_dict(metrics, sync_dist=True)

This does not change the overall behavior.

What's your environment?

  • OS: Ubuntu 18.04
  • Packaging: pip
  • Version: pytorch-lightning 1.1.0 (Master), pytorch 1.7.0, cuda 11.0
@tchang1997 tchang1997 added the question Further information is requested label Dec 10, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@tchang1997
Copy link
Author

tchang1997 commented Dec 11, 2020

Update:

Replicated in the BoringColab with a dummy dataset: https://colab.research.google.com/drive/1_o9L1kOmr8xtB-mS0y3bWWa2o1STh2bA

EDIT: I've forked pytorch-lightning locally and made some edits to the trainer.py, training_loop.py, evaluation_loop.py, and logger_connector.py files, such that when logging, the logger is aware of when it's called from within the training loop, even when an evaluation loop is called from within training (which happens when val_check_interval < 1 or < num_batches).

There is some architecture for logger awareness of where we are in the model lifecycle, but there isn't logic for hooking up the EvaluationLoop object with the logger when it's called from within training, which makes sense for 99% of use-cases -- but for monitoring validation metrics during a long training epoch, the WandbLogger previously threw a bunch of warnings and dropped the logs. Basically, the only change is a keyword that gets propagated down from the evaluation loop call to the logger connector.

When replacing pytorch-lightning installation with my own via pip install git+https://github.com/tchainzzz/pytorch-lightning.git@master --upgrade, I get the expected behavior I outlined above in the Colab. To make sure the logging was functional, I inspected the generated logfile, and performed online runs here, testing both a run with val_check_interval set to <1 epoch and val_check_interval as default.

If this looks good, I'm happy to make PR. Here's a diff of my changes. Local CPU-only tests are passing (except for 2 onnxruntime dependency issues -- I can't get it installed). Granted, this feels like a bit of a niche/janky feature to me (having the eval loop being aware that it's "inside" the training loop feels intuitively odd), but it's a QOL update for using val_check_interval that suits my needs perfectly.

TL;DR Made a fork where I fixed this issue. Doesn't seem to break old stuff. Happy to make PR after discussion.

@borisdayma
Copy link
Contributor

Hi @tchainzzz

Within your validation loop (probably at "epoch_end") you could try to call directly wandb.log({'val_loss':loss}, commit=False) which would just use the last logged step.

@tchang1997
Copy link
Author

That's a good point -- I hadn't thought of that. I've been working off of a local fork of the repo, which works. Thanks for the tip!

Seems like you're trying to address this/similar problems with a feature-request -- I'll close this, and see if any of my changes might be useful for that proposal.

@rohitgr7 rohitgr7 linked a pull request Jan 6, 2021 that will close this issue
12 tasks
@borisdayma
Copy link
Contributor

This was fixed with #5931.
You can now just log at any time with self.log('my_metric', my_value) and you won't have any dropped value.
Just choose your x-axis appropriately in the UI (whether global_step or just the auto-incremented step).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants