Quality of life and helper callback functions #237

laserkelvin · 2024-06-07T21:32:21Z

This PR introduces and adds a bunch of changes pertaining to informing the user of things happening under the hood, particularly during training.

One of the big philosophical changes is also focusing more on enabling logging to be done with TensorBoardLogger and WandbLoggers by writing functions more tailored to them, rather than before where loggers were treated in the abstract entirely.

Summary

Changed the use of coordinates in periodic boundary utilities to use cartesian coordinates, not fractional coordinates. Also included a warning message that looks at the coordinates as part of diagnostics.
For model training, task modules now include log_embeddings and log_embeddings_every_n_steps arguments that are saved to hparams, which as the pair suggests, allow you to regularly log embedding vectors for analysis. This will let you ensure oversmoothing doesn't occur, where all of the embedding features become identical.
Introduced a TrainingHelperCallback, which is intended to help diagnose some common issues with training, such as unused parameters, missing gradients, tiny gradients, etc. Complimentary to the change above, there is an option to inject a forward hook to any encoder (assuming it produces an Embeddings structure), and uses it to calculate the variance in embeddings.
Introduced a ModelAutocorrelation callback, which will perform an autocorrelation analysis on model parameters and gradients over the course of training. Basically this gives you some insight into how the training dynamics appear, i.e. too much correlation = probably not good.

My intention for the TrainingHelperCallback is to be like a guide for best practices: we can refine this as we go and discover new things, and hopefully will be useful for everyone including new users.

laserkelvin · 2024-06-10T15:30:56Z

I have somehow broken SAM and need to fix it first before review

laserkelvin · 2024-06-10T15:53:34Z

I think I have a lead on what the issue is: because of how SAM works, and because of the modifications to "stashing" embeddings in the batch structure, we now end up with two disjoint computational graphs that causes backward to break.

This needs a bit of thought to fix...

laserkelvin · 2024-06-10T16:01:57Z

Confirming this by changing out the BaseTaskModule.forward:

        if "embeddings" in batch:
            embeddings = batch.get("embeddings")
        else:
            embeddings = self.encoder(batch)
            batch["embeddings"] = embeddings
        outputs = self.process_embedding(embeddings)
        return outputs

Removing the branch, and just running the encoder + processing embeddings works (i.e. don't try and grab cached embeddings).

Ideally there would be a way to check if embeddings originated from the same computational graph, but that take a lot more surgery than this PR warrants. I'll think of an alternative to this.

The reason we are stashing the embeddings is to benefit the multitask case, where we would want to not have to run the encoder X times for X tasks and datasets.

melo-gonzo

Some great features here, thanks for doing all of this! I threw in a few comments, I know you're still working on things.

matsciml/lightning/callbacks.py

examples/callbacks/autocorrelation.py

matsciml/lightning/callbacks.py

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

That way we don't do a double log as forward might be called multiple times

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

This will only still function for wandb/tensorboard, but supports multiple

…ng them This addresses the issue of computational graphs breaking

…n multitas

melo-gonzo

This will bring some great utilities and helpful debugging tools! Looks good to merge.

laserkelvin added ux User experience, quality of life changes training Issues related to model training labels Jun 7, 2024

laserkelvin requested a review from melo-gonzo June 7, 2024 21:32

melo-gonzo reviewed Jun 10, 2024

View reviewed changes

laserkelvin added 23 commits July 1, 2024 10:49

refactor: ruff fixes and adding fractional coordinate check

7abff49

feat: added embedding forward hook check

0e4e8bc

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

feat: added encoder forward hook to helper

a433685

feat: added encoder-outputhead compaison

7759f09

feat: working grad norm logging

7221eff

refactor: changing variance value to a much smaller value

17297f2

docs: adding docstrings throughout helper

4372d75

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

feat: added function to log embeddings

1a84d16

feat: adding embedding logging call

552cf1b

refactor: using hparams for log embedding kwarg

9922415

fix: adding global step specification in tensorboard embedding log

2349305

refactor: adding global step to add embedding

1a749be

refactor: making forward generically stash embeddings

cac78fe

refactor: putting embedding logging in train step

53d640b

That way we don't do a double log as forward might be called multiple times

refactor: adding embedding logging to independent steps

d7e1f07

chore: rebasing main to finalize PR

997a1be

refactor: added log embedding frequency control

1640ef8

refactor: cleaned up log_embeddings function

a9dd125

test: added simple unit test for forward hook

7186df6

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

feat: implemented working autocorrelation callback

44150f5

docs: added a variety of docstrings for model autocorrelation

8eb876f

docs: added docstring for helper callback

9c48a5c

scripts: added scripts to demonstrate callback usage

a8ccae8

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

laserkelvin added 4 commits July 1, 2024 10:50

refactor: using cartesian coordinates as regular inputs

28e1a33

refactor: taking absolute value of the median for comparison

af745c0

refactor: now looping over multiple loggers, if any are supplied

515e4b8

This will only still function for wandb/tensorboard, but supports multiple

refactor: making forward pass only set embeddings to batch, not reusi…

7f726b9

…ng them This addresses the issue of computational graphs breaking

laserkelvin force-pushed the helper-callback branch from aeab6cd to 7f726b9 Compare July 1, 2024 17:50

laserkelvin added 5 commits July 1, 2024 10:55

fix: nesting scheduler stepping mechanism only if something is passed

b888ee8

refactor: allowing multi task subtasks to reuse shared embedding

c182d90

refactor: adding log embeddings kwargs to multi task litmodule

2ca1367

feat: added log embeddings method to multitask litmodule

43175b4

refactor: logging embeddings for both training and validation steps i…

d53cde4

…n multitas

melo-gonzo approved these changes Jul 1, 2024

View reviewed changes

laserkelvin merged commit 0e3a640 into IntelLabs:main Jul 1, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality of life and helper callback functions #237

Quality of life and helper callback functions #237

laserkelvin commented Jun 7, 2024 •

edited

Loading

laserkelvin commented Jun 10, 2024

laserkelvin commented Jun 10, 2024

laserkelvin commented Jun 10, 2024 •

edited

Loading

melo-gonzo left a comment

melo-gonzo left a comment

Quality of life and helper callback functions #237

Quality of life and helper callback functions #237

Conversation

laserkelvin commented Jun 7, 2024 • edited Loading

Summary

laserkelvin commented Jun 10, 2024

laserkelvin commented Jun 10, 2024

laserkelvin commented Jun 10, 2024 • edited Loading

melo-gonzo left a comment

Choose a reason for hiding this comment

melo-gonzo left a comment

Choose a reason for hiding this comment

laserkelvin commented Jun 7, 2024 •

edited

Loading

laserkelvin commented Jun 10, 2024 •

edited

Loading