Update docs for multiple optimizers in 2.0 (#16588)

awaelchli · web-flow · commit 01b152f16926 · 2023-02-01T17:34:55.000Z
diff --git a/docs/source-pytorch/common/lightning_module.rst b/docs/source-pytorch/common/lightning_module.rst
@@ -1155,9 +1155,8 @@ See :ref:`manual optimization <common/optimization:Manual optimization>` for det
         self.manual_backward(loss)
         opt.step()
 
-This is recommended only if using 2+ optimizers AND if you know how to perform the optimization procedure properly. Note
-that automatic optimization can still be used with multiple optimizers by relying on the ``optimizer_idx`` parameter.
 Manual optimization is most useful for research topics like reinforcement learning, sparse coding, and GAN research.
+It is required when you are using 2+ optimizers because with automatic optimization, you can only use one optimizer.
 
 .. code-block:: python
 
diff --git a/docs/source-pytorch/common/optimization.rst b/docs/source-pytorch/common/optimization.rst
@@ -14,7 +14,7 @@ Lightning offers two modes for managing the optimization process:
 For the majority of research cases, **automatic optimization** will do the right thing for you and it is what most
 users should use.
 
-For advanced/expert users who want to do esoteric optimization schedules or techniques, use **manual optimization**.
+For more advanced use cases like multiple optimizers, esoteric optimization schedules or techniques, use **manual optimization**.
 
 .. _manual_optimization:
 
@@ -39,7 +39,7 @@ Under the hood, Lightning does the following:
         for batch in data:
 
             def closure():
-                loss = model.training_step(batch, batch_idx, ...)
+                loss = model.training_step(batch, batch_idx)
                 optimizer.zero_grad()
                 loss.backward()
                 return loss
@@ -48,33 +48,13 @@ Under the hood, Lightning does the following:
 
         lr_scheduler.step()
 
-In the case of multiple optimizers, Lightning does the following:
-
-.. code-block:: python
-
-    for epoch in epochs:
-        for batch in data:
-            for opt in optimizers:
-
-                def closure():
-                    loss = model.training_step(batch, batch_idx, optimizer_idx)
-                    opt.zero_grad()
-                    loss.backward()
-                    return loss
-
-                opt.step(closure)
-
-        for lr_scheduler in lr_schedulers:
-            lr_scheduler.step()
-
 As can be seen in the code snippet above, Lightning defines a closure with ``training_step()``, ``optimizer.zero_grad()``
 and ``loss.backward()`` for the optimization. This mechanism is in place to support optimizers which operate on the
 output of the closure (e.g. the loss) or need to call the closure several times (e.g. :class:`~torch.optim.LBFGS`).
 
-.. warning::
-
-   Before v1.2.2, Lightning internally calls ``backward``, ``step`` and ``zero_grad`` in the order.
-   From v1.2.2, the order is changed to ``zero_grad``, ``backward`` and ``step``.
+Should you still require the flexibility of calling ``.zero_grad()``, ``.backward()``, or ``.step()`` yourself, you can
+always switch to :ref:`manual optimization <manual_optimization>`.
+Manual optimization is required if you wish to work with multiple optimizers.
 
 
 Gradient Accumulation
@@ -83,113 +63,6 @@ Gradient Accumulation
 .. include:: ../common/gradient_accumulation.rst
 
 
-Use Multiple Optimizers (like GANs)
-===================================
-
-To use multiple optimizers (optionally with learning rate schedulers), return two or more optimizers from
-:meth:`~pytorch_lightning.core.module.LightningModule.configure_optimizers`.
-
-.. testcode:: python
-
-    # two optimizers, no schedulers
-    def configure_optimizers(self):
-        return Adam(...), SGD(...)
-
-
-    # two optimizers, one scheduler for adam only
-    def configure_optimizers(self):
-        opt1 = Adam(...)
-        opt2 = SGD(...)
-        optimizers = [opt1, opt2]
-        lr_schedulers = {"scheduler": ReduceLROnPlateau(opt1, ...), "monitor": "metric_to_track"}
-        return optimizers, lr_schedulers
-
-
-    # two optimizers, two schedulers
-    def configure_optimizers(self):
-        opt1 = Adam(...)
-        opt2 = SGD(...)
-        return [opt1, opt2], [StepLR(opt1, ...), OneCycleLR(opt2, ...)]
-
-Under the hood, Lightning will call each optimizer sequentially:
-
-.. code-block:: python
-
-    for epoch in epochs:
-        for batch in data:
-            for opt in optimizers:
-                loss = train_step(batch, batch_idx, optimizer_idx)
-                opt.zero_grad()
-                loss.backward()
-                opt.step()
-
-        for lr_scheduler in lr_schedulers:
-            lr_scheduler.step()
-
-
-Step Optimizers at Arbitrary Intervals
-=======================================
-
-To do more interesting things with your optimizers such as learning rate warm-up or odd scheduling,
-override the :meth:`~pytorch_lightning.core.module.LightningModule.optimizer_step` function.
-
-.. warning::
-    If you are overriding this method, make sure that you pass the ``optimizer_closure`` parameter to
-    ``optimizer.step()`` function as shown in the examples because ``training_step()``, ``optimizer.zero_grad()``,
-    ``loss.backward()`` are called in the closure function.
-
-For example, here step optimizer A every batch and optimizer B every 2 batches.
-
-.. testcode:: python
-
-    # Alternating schedule for optimizer steps (e.g. GANs)
-    def optimizer_step(
-        self,
-        epoch,
-        batch_idx,
-        optimizer,
-        optimizer_idx,
-        optimizer_closure,
-    ):
-        # update generator every step
-        if optimizer_idx == 0:
-            optimizer.step(closure=optimizer_closure)
-
-        # update discriminator every 2 steps
-        if optimizer_idx == 1:
-            if (batch_idx + 1) % 2 == 0:
-                # the closure (which includes the `training_step`) will be executed by `optimizer.step`
-                optimizer.step(closure=optimizer_closure)
-            else:
-                # call the closure by itself to run `training_step` + `backward` without an optimizer step
-                optimizer_closure()
-
-        # ...
-        # add as many optimizers as you want
-
-Here we add a manual learning rate warm-up without an lr scheduler.
-
-.. testcode:: python
-
-    # learning rate warm-up
-    def optimizer_step(
-        self,
-        epoch,
-        batch_idx,
-        optimizer,
-        optimizer_idx,
-        optimizer_closure,
-    ):
-        # update params
-        optimizer.step(closure=optimizer_closure)
-
-        # skip the first 500 steps
-        if self.trainer.global_step < 500:
-            lr_scale = min(1.0, float(self.trainer.global_step + 1) / 500.0)
-            for pg in optimizer.param_groups:
-                pg["lr"] = lr_scale * self.hparams.learning_rate
-
-
 Access your Own Optimizer
 =========================
 
@@ -206,7 +79,6 @@ to perform a step, Lightning won't be able to support accelerators, precision an
         epoch,
         batch_idx,
         optimizer,
-        optimizer_idx,
         optimizer_closure,
     ):
         optimizer.step(closure=optimizer_closure)
@@ -220,7 +92,6 @@ to perform a step, Lightning won't be able to support accelerators, precision an
         epoch,
         batch_idx,
         optimizer,
-        optimizer_idx,
         optimizer_closure,
     ):
         optimizer = optimizer.optimizer
@@ -248,7 +119,7 @@ If you are using native PyTorch schedulers, there is no need to override this ho
         return [optimizer], [{"scheduler": scheduler, "interval": "epoch"}]
 
 
-    def lr_scheduler_step(self, scheduler, optimizer_idx, metric):
+    def lr_scheduler_step(self, scheduler, metric):
         scheduler.step(epoch=self.current_epoch)  # timm's scheduler need the epoch value
 
 
@@ -259,7 +130,7 @@ Configure Gradient Clipping
 
 To configure custom gradient clipping, consider overriding
 the :meth:`~pytorch_lightning.core.module.LightningModule.configure_gradient_clipping` method.
-Attributes ``gradient_clip_val`` and ``gradient_clip_algorithm`` from Trainer will be passed in the
+The attributes ``gradient_clip_val`` and ``gradient_clip_algorithm`` from Trainer will be passed in the
 respective arguments here and Lightning will handle gradient clipping for you. In case you want to set
 different values for your arguments of your choice and let Lightning handle the gradient clipping, you can
 use the inbuilt :meth:`~pytorch_lightning.core.module.LightningModule.clip_gradients` method and pass
@@ -270,31 +141,16 @@ the arguments along with your optimizer.
     method. If you want to customize gradient clipping, consider using
     :meth:`~pytorch_lightning.core.module.LightningModule.configure_gradient_clipping` method.
 
-For example, here we will apply gradient clipping only to the gradients associated with optimizer A.
+For example, here we will apply a stronger gradient clipping after a certain number of epochs:
 
 .. testcode:: python
 
-    def configure_gradient_clipping(self, optimizer, optimizer_idx, gradient_clip_val, gradient_clip_algorithm):
-        if optimizer_idx == 0:
-            # Lightning will handle the gradient clipping
-            self.clip_gradients(
-                optimizer, gradient_clip_val=gradient_clip_val, gradient_clip_algorithm=gradient_clip_algorithm
-            )
-
-Here we configure gradient clipping differently for optimizer B.
-
-.. testcode:: python
+    def configure_gradient_clipping(self, optimizer, gradient_clip_val, gradient_clip_algorithm):
+        if self.current_epoch > 5:
+            gradient_clip_val = gradient_clip_val * 2
 
-    def configure_gradient_clipping(self, optimizer, optimizer_idx, gradient_clip_val, gradient_clip_algorithm):
-        if optimizer_idx == 0:
-            # Lightning will handle the gradient clipping
-            self.clip_gradients(
-                optimizer, gradient_clip_val=gradient_clip_val, gradient_clip_algorithm=gradient_clip_algorithm
-            )
-        elif optimizer_idx == 1:
-            self.clip_gradients(
-                optimizer, gradient_clip_val=gradient_clip_val * 2, gradient_clip_algorithm=gradient_clip_algorithm
-            )
+        # Lightning will handle the gradient clipping
+        self.clip_gradients(optimizer, gradient_clip_val=gradient_clip_val, gradient_clip_algorithm=gradient_clip_algorithm)
 
 
 Total Stepping Batches
@@ -312,4 +168,4 @@ distributed setting into consideration so you don't have to derive it manually.
         scheduler = torch.optim.lr_scheduler.OneCycleLR(
             optimizer, max_lr=1e-3, total_steps=self.trainer.estimated_stepping_batches
         )
-        return [optimizer], [scheduler]
+        return optimizer, scheduler
diff --git a/docs/source-pytorch/guides/speed.rst b/docs/source-pytorch/guides/speed.rst
@@ -431,7 +431,7 @@ This is enabled by default on ``torch>=2.0.0``.
 .. testcode::
 
     class Model(LightningModule):
-        def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):
+        def optimizer_zero_grad(self, epoch, batch_idx, optimizer):
             optimizer.zero_grad(set_to_none=True)
 
 
diff --git a/docs/source-pytorch/model/build_model_advanced.rst b/docs/source-pytorch/model/build_model_advanced.rst
@@ -17,7 +17,7 @@ Inject custom code anywhere in the Training loop using any of the 20+ methods (:
 .. testcode::
 
     class LitModel(pl.LightningModule):
-        def backward(self, loss, optimizer, optimizer_idx):
+        def backward(self, loss):
             loss.backward()
 
 ----
diff --git a/docs/source-pytorch/model/manual_optimization.rst b/docs/source-pytorch/model/manual_optimization.rst
@@ -3,11 +3,10 @@ Manual Optimization
 *******************
 
 For advanced research topics like reinforcement learning, sparse coding, or GAN research, it may be desirable to
-manually manage the optimization process.
+manually manage the optimization process, especially when dealing with multiple optimizers at the same time.
 
-This is only recommended for experts who need ultimate flexibility.
-Lightning will handle only accelerator, precision and strategy logic.
-The users are left with ``optimizer.zero_grad()``, gradient accumulation, model toggling, etc..
+In this mode, Lightning will handle only accelerator, precision and strategy logic.
+The users are left with ``optimizer.zero_grad()``, gradient accumulation, optimizer toggling, etc..
 
 To manually optimize, do the following:
 
@@ -18,6 +17,7 @@ To manually optimize, do the following:
   * ``optimizer.zero_grad()`` to clear the gradients from the previous training step
   * ``self.manual_backward(loss)`` instead of ``loss.backward()``
   * ``optimizer.step()`` to update your model parameters
+  * ``self.toggle_optimizer()`` and ``self.untoggle_optimizer()`` if needed
 
 Here is a minimal example of manual optimization.
 
@@ -39,10 +39,6 @@ Here is a minimal example of manual optimization.
             self.manual_backward(loss)
             opt.step()
 
-.. warning::
-   Before 1.2, ``optimizer.step()`` was calling ``optimizer.zero_grad()`` internally.
-   From 1.2, it is left to the user's expertise.
-
 .. tip::
    Be careful where you call ``optimizer.zero_grad()``, or your model won't converge.
    It is good practice to call ``optimizer.zero_grad()`` before ``self.manual_backward(loss)``.
@@ -132,6 +128,7 @@ To perform gradient clipping with one optimizer with manual optimization, you ca
 .. warning::
    * Note that ``configure_gradient_clipping()`` won't be called in Manual Optimization. Instead consider using ``self. clip_gradients()`` manually like in the example above.
 
+
 Use Multiple Optimizers (like GANs)
 ===================================
 
@@ -285,6 +282,34 @@ If you want to call schedulers that require a metric value after each epoch, con
         if isinstance(sch, torch.optim.lr_scheduler.ReduceLROnPlateau):
             sch.step(self.trainer.callback_metrics["loss"])
 
+
+Optimizer Steps at Different Frequencies
+========================================
+
+In manual optimization, you are free to ``step()`` one optimizer more often than another one.
+For example, here we step the optimizer for the *discriminator* weights twice as often as the optimizer for the *generator*.
+
+.. testcode:: python
+
+    # Alternating schedule for optimizer steps (e.g. GANs)
+    def training_step(self, batch, batch_idx):
+        g_opt, d_opt = self.optimizers()
+        ...
+
+        # update discriminator every other step
+        d_opt.zero_grad()
+        self.manual_backward(errD)
+        if (batch_idx + 1) % 2 == 0:
+            d_opt.step()
+
+        ...
+
+        # update generator every step
+        g_opt.zero_grad()
+        self.manual_backward(errG)
+        g_opt.step()
+
+
 Use Closure for LBFGS-like Optimizers
 =====================================
 
diff --git a/docs/source-pytorch/starter/introduction.rst b/docs/source-pytorch/starter/introduction.rst
@@ -282,7 +282,7 @@ Inject custom code anywhere in the Training loop using any of the 20+ methods (:
 .. testcode::
 
     class LitAutoEncoder(pl.LightningModule):
-        def backward(self, loss, optimizer, optimizer_idx):
+        def backward(self, loss):
             loss.backward()
 
 ----