|
| 1 | +# Deep Deterministic Policy Gradient (DDPG) |
| 2 | + |
| 3 | +**Paper Link:** [**https://arxiv.org/abs/1509.02971**](https://arxiv.org/abs/1509.02971). |
| 4 | + |
| 5 | +Deep Deterministic Policy Gradient (DDPG) is a model-free algorithm in Deep Reinforcement Learning (DRL) that combines elements of policy gradient and deep neural networks. It was developed by Timothy P. Lillicrap et al. in 2015. DDPG has been widely applied in continuous control tasks, achieving notable results in scenarios such as robotic control and simulated environments. |
| 6 | + |
| 7 | +This table lists some general features about DDPG algorithm: |
| 8 | + |
| 9 | +| Features of DDPG | Values | Description | |
| 10 | +|-------------------|--------|----------------------------------------------------------| |
| 11 | +| On-policy | ❌ | The evaluate policy is the same as the target policy. | |
| 12 | +| Off-policy | ✅ | The evaluate policy is different from the target policy. | |
| 13 | +| Model-free | ✅ | No need to prepare an environment dynamics model. | |
| 14 | +| Model-based | ❌ | Need an environment model to train the policy. | |
| 15 | +| Discrete Action | ✅ | Deal with discrete action space. | |
| 16 | +| Continuous Action | ✅ | Deal with continuous action space. | |
| 17 | + |
| 18 | +## Actor-Critic Framework |
| 19 | + |
| 20 | +DDPG builds upon the Actor-Critic framework. The actor network is responsible for generating actions based on the current state. It outputs a deterministic action directly, parameterized by a deep neural network. The critic network, on the other hand, estimates the Q-value of a state-action pair. The critic network takes both the state and the action as inputs and outputs the estimated Q-value, which indicates the long-term expected reward of taking a particular action in a given state. |
| 21 | + |
| 22 | + |
| 23 | +## Policy Gradient |
| 24 | + |
| 25 | +The actor in DDPG is updated using the policy gradient method. The policy gradient is calculated by taking the gradient of the expected return with respect to the actor's parameters. The goal is to find the policy that maximizes the expected return. The update rule for the actor network is based on the gradient of the Q-value with respect to the actor's parameters, which is calculated using the chain rule of calculus. |
| 26 | +The update for the actor network parameters $\theta^\mu$ is given by: |
| 27 | + |
| 28 | +$$ |
| 29 | +\nabla_{\theta^\mu} J \approx \frac{1}{N} \sum_{i=1}^{N} \nabla_a Q(s, a \mid \theta^Q) \bigg|_{s=s^i, a=\mu(s^i \mid \theta^\mu)} \nabla_{\theta^\mu} \mu(s \mid \theta^\mu) \bigg|_{s^i} |
| 30 | +$$ |
| 31 | + |
| 32 | +where $J$ is the expected return, $N$ is the number of samples in a mini-batch,$\theta^Q$ are the parameters of the critic network, and $\mu(s \mid \theta^\mu)$ is the actor network. |
| 33 | + |
| 34 | +## Critic Network Update |
| 35 | +The critic network in DDPG is updated using the temporal difference (TD) error. The TD error is calculated as the difference between the target Q-value and the predicted Q-value. The target Q-value is calculated using the Bellman equation, similar to Q-learning. The update rule for the critic network is based on minimizing the mean-squared error (MSE) between the predicted Q-value and the target Q-value. |
| 36 | +The target Q-value $y^i$ for a sample $(s^i, a^i, r^i, s^{i+1})$ is given by: |
| 37 | + |
| 38 | +$$ |
| 39 | +y^i = r^i + \gamma Q'(s^{i+1}, \mu'(s^{i+1} \mid \theta^{\mu'}) \mid \theta^{Q'}) |
| 40 | +$$ |
| 41 | + |
| 42 | +where $\mu'$ and $\theta^{Q'}$ are the target actor network and the parameters of target critic network respectively, $Q'$ is the target critic network and $\gamma$ is the discount factor. |
| 43 | +The critic network parameters $\theta^Q$ are updated by minimizing the loss function: |
| 44 | + |
| 45 | +$$ |
| 46 | +L = \frac{1}{N} \sum_{i=1}^{N} (y^i - Q(s^i, a^i \mid \theta^Q))^2 |
| 47 | +$$ |
| 48 | + |
| 49 | +## Target Networks and Experience Replay |
| 50 | +Similar to DQN, DDPG uses target networks to stabilize the learning process. Separate target actor and target critic networks are maintained. The parameters of the target networks are updated slowly from the main networks using a soft update rule: |
| 51 | + |
| 52 | +$$ |
| 53 | +\theta' \leftarrow \tau \theta + (1 - \tau) \theta' |
| 54 | +$$ |
| 55 | + |
| 56 | +where $\tau$ is a small positive number, typically close to 1, controlling the rate of update. |
| 57 | +DDPG also employs experience replay. An experience replay buffer stores the agent's experiences $(s^i, a^i, r^i, s^{i+1})$ .Mini-batches of experiences are randomly sampled from the buffer to train the actor and critic networks. This helps to break the correlation between consecutive samples and improves the stability and generalization of the learning algorithm. |
| 58 | + |
| 59 | +## Exploration |
| 60 | +DDPG uses an exploration strategy to encourage the agent to explore the environment. Typically, a noise process is added to the actions generated by the actor network: |
| 61 | + |
| 62 | +$$ |
| 63 | +\mu'(s_t) = \mu(s_t \mid \theta^{\mu}_t) + \mathcal{N} |
| 64 | +$$ |
| 65 | + |
| 66 | +This noise $\mathcal{N}$ can be, for example, Gaussian noise. The exploration noise helps the agent to explore different parts of the action space and discover better policies. As the training progresses, the amount of exploration noise can be gradually reduced. |
| 67 | + |
| 68 | +Strengths of DDPG: |
| 69 | +- Can handle continuous action spaces, making it suitable for a wide range of control tasks in robotics and other fields. |
| 70 | +- Stabilizes the learning process through the use of target networks and experience replay, enabling more reliable training. |
| 71 | +- Demonstrated effectiveness in various continuous control scenarios, achieving good performance in tasks like robotic manipulation and locomotion. |
| 72 | + |
| 73 | +## Algorithm |
| 74 | + |
| 75 | +The full algorithm for training DDPG is presented in Algorithm 1: |
| 76 | + |
| 77 | +```{eval-rst} |
| 78 | +.. image:: ./../../../../_static/figures/pseucodes/pseucode-DDPG.png |
| 79 | + :width: 80% |
| 80 | + :align: center |
| 81 | +``` |
| 82 | + |
| 83 | +## Run DDPG in XuanCe |
| 84 | + |
| 85 | +Before running DDPG in XuanCe, you need to prepare a conda environment and install ``xuance`` following |
| 86 | +the [**installation steps**](./../../../usage/installation.rst#install-xuance). |
| 87 | + |
| 88 | +### Run Build-in Demos |
| 89 | + |
| 90 | +After completing the installation, you can open a Python console and run DDPG directly using the following commands: |
| 91 | + |
| 92 | +```python3 |
| 93 | +import xuance |
| 94 | +runner = xuance.get_runner(method='ddpg', |
| 95 | + env='classic_control', # Choices: claasi_control, box2d, atari. |
| 96 | + env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc. |
| 97 | + is_test=False) |
| 98 | +runner.run() # Or runner.benchmark() |
| 99 | +``` |
| 100 | + |
| 101 | +### Run With Self-defined Configs |
| 102 | + |
| 103 | +If you want to run DDPG with different configurations, you can build a new ``.yaml`` file, e.g., ``my_config.yaml``. |
| 104 | +Then, run the DDPG by the following code block: |
| 105 | + |
| 106 | +```python3 |
| 107 | +import xuance as xp |
| 108 | +runner = xp.get_runner(method='ddpg', |
| 109 | + env='classic_control', # Choices: claasi_control, box2d, atari. |
| 110 | + env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc. |
| 111 | + config_path="my_config.yaml", # The path of my_config.yaml file should be correct. |
| 112 | + is_test=False) |
| 113 | +runner.run() # Or runner.benchmark() |
| 114 | +``` |
| 115 | + |
| 116 | +To learn more about the configurations, please visit the |
| 117 | +[**tutorial of configs**](./../../configs/configuration_examples.rst). |
| 118 | + |
| 119 | +### Run With Customized Environment |
| 120 | + |
| 121 | +If you would like to run XuanCe's DDPG in your own environment that was not included in XuanCe, |
| 122 | +you need to define the new environment following the steps in |
| 123 | +[**New Environment Tutorial**](./../../../usage/new_envs.rst). |
| 124 | +Then, [**prepapre the configuration file**](./../../../usage/new_envs.rst#step-2-create-the-config-file-and-read-the-configurations) |
| 125 | + ``ddpg_myenv.yaml``. |
| 126 | + |
| 127 | +After that, you can run DDPG in your own environment with the following code: |
| 128 | + |
| 129 | +```python3 |
| 130 | +import argparse |
| 131 | +from xuance.common import get_configs |
| 132 | +from xuance.environment import REGISTRY_ENV |
| 133 | +from xuance.environment import make_envs |
| 134 | +from xuance.torch.agents import DDPG_Agent |
| 135 | + |
| 136 | +configs_dict = get_configs(file_dir="ddpg_myenv.yaml") |
| 137 | +configs = argparse.Namespace(**configs_dict) |
| 138 | +REGISTRY_ENV[configs.env_name] = MyNewEnv |
| 139 | + |
| 140 | +envs = make_envs(configs) # Make parallel environments. |
| 141 | +Agent = DDPG_Agent(config=configs, envs=envs) # Create a DDPG agent from XuanCe. |
| 142 | +Agent.train(configs.running_steps // configs.parallels) # Train the model for numerous steps. |
| 143 | +Agent.save_model("final_train_model.pth") # Save the model to model_dir. |
| 144 | +Agent.finish() # Finish the training. |
| 145 | +``` |
| 146 | + |
| 147 | +## Citation |
| 148 | + |
| 149 | +```{code-block} bash |
| 150 | +@article{lillicrap2015continuous, |
| 151 | + title={Continuous control with deep reinforcement learning}, |
| 152 | + author={Lillicrap, Timothy P and Hunt, Jonathan J and Pritzel, Alexander and Heess, Nicolas and Erez, Tom and Tassa, Yuval and Silver, David and Wierstra, Daan}, |
| 153 | + journal={arXiv preprint arXiv:1509.02971}, |
| 154 | + year={2015} |
| 155 | +} |
| 156 | +``` |
| 157 | + |
| 158 | +## APIs |
| 159 | + |
| 160 | +### PyTorch |
| 161 | + |
| 162 | +```{eval-rst} |
| 163 | +.. automodule:: xuance.torch.agents.policy_gradient.ddpg_agent |
| 164 | + :members: |
| 165 | + :undoc-members: |
| 166 | + :show-inheritance: |
| 167 | +``` |
| 168 | + |
| 169 | +### TensorFlow2 |
| 170 | + |
| 171 | +```{eval-rst} |
| 172 | +.. automodule:: xuance.tensorflow.agents.policy_gradient.ddpg_agent |
| 173 | + :members: |
| 174 | + :undoc-members: |
| 175 | + :show-inheritance: |
| 176 | +``` |
| 177 | + |
| 178 | +### MindSpore |
| 179 | + |
| 180 | +```{eval-rst} |
| 181 | +.. automodule:: xuance.mindspore.agents.policy_gradient.ddpg_agent |
| 182 | + :members: |
| 183 | + :undoc-members: |
| 184 | + :show-inheritance: |
| 185 | +``` |
0 commit comments