|
| 1 | +# Advantage Actor Critic(A2C) |
| 2 | + |
| 3 | +The Advantage Actor Critic(A2C) algorithm is a reinforcement learning method which is based on the policy gradient method and the value function approximation. It uses an advantage function instead of an action-state value function to improve the stability and performance of learning. |
| 4 | + |
| 5 | +This table lists some general features about A2C algorithm: |
| 6 | + |
| 7 | + |
| 8 | +| Features of A2C | Values | Description | |
| 9 | +|-------------------| ------ | ------------------------------------------------------ | |
| 10 | +| On-policy | ✅ | The evaluate policy is the same as the target policy. | |
| 11 | +| Off-Policy | ❌ | The evaluate policy is different as the target policy. | |
| 12 | +| Model-free | ✅ | No need to prepare an environment dynamics model. | |
| 13 | +| Model-based | ❌ | Need an environment model to train the policy. | |
| 14 | +| Discrete Action | ✅ | Deal with discrete action space. | |
| 15 | +| Continuous Action | ✅ | Deal with continuous action space. | |
| 16 | + |
| 17 | +## Actor Critic(AC) Framework |
| 18 | + |
| 19 | +Actor Critic Method selects actions through the Actor, evaluates these actions through the Critic, and cooperates with each other to improve. |
| 20 | + |
| 21 | +### Critic |
| 22 | + |
| 23 | +Critic is also called value network which uses neural network $Q^\pi(s,a;w)$ to approximate the action value function $Q^\pi(s,a)$. In one-step Q-Learning, the parameters $w$ of the action value function $Q^\pi(s,a;w)$ are learned by iteratively minimizing a sequence of loss functions, where the $i$th loss function defined as |
| 24 | + |
| 25 | +$$ |
| 26 | +L_i(w_i)=\mathbb{E}[(r+\gamma Q(s',a';w_{i})-Q(s,a;w_i))^2] |
| 27 | +$$ |
| 28 | + |
| 29 | +where s′ is the state encountered after state s. |
| 30 | + |
| 31 | +### Actor |
| 32 | + |
| 33 | +Actor is also called policy network which is similar to the [**Policy Gradient(PG)**](./pg_agent.md) method. Actor directly optimizes the policy to achieve the maximum reward. Its' objective function is expressed in the following: |
| 34 | + |
| 35 | +$$ |
| 36 | +J(\theta) = \mathbb{E}_{\pi_{\theta}}{[\sum_{t=0}^{\infty}{\gamma^t r_t}]}. |
| 37 | +$$ |
| 38 | + |
| 39 | +To optimize the policy function $\pi_\theta$, we calculate the gradient of the objective function $J(\theta)$ with respect to the parameters $\theta$: |
| 40 | + |
| 41 | +$$ |
| 42 | +\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log{\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s, a)}]. |
| 43 | +$$ |
| 44 | + |
| 45 | +In the process of training the actor, we use the action value $Q(s,a;w)$ obtained by the critic as an approximation of the $Q(s,a)$. By alternately training the actor and the critic, we ultimately achieve the goal of maximizing $J(\theta)$. |
| 46 | + |
| 47 | +## Advantage Actor Critic(A2C) |
| 48 | + |
| 49 | +In the above-mentioned Actor-Critic framework, we use $Q(s,a;w)$ to update the policy. In Advantage Actor Critic (A2C), we use the advantage function to represent the additional reward for taking a certain action relative to the average state value: |
| 50 | + |
| 51 | +$$ |
| 52 | +A(a_t,s_t)=Q(a_t,s_t)−V(s_t)\approx r_t+\gamma V(s_{t+1}) - V(s_t), |
| 53 | +$$ |
| 54 | + |
| 55 | +where |
| 56 | + |
| 57 | +$$ |
| 58 | +Q_\pi(a_t,s_t)−V_\pi(s_t)=\mathbb{E}[R_t+\gamma v_\pi(S_{t+1}) - v_\pi(S_t)|S_t=s_t]. |
| 59 | +$$ |
| 60 | + |
| 61 | +The advantage function reduces the variance in policy gradient estimates. And then we can rewrite the gradient of the objective function $J(\theta)$: |
| 62 | + |
| 63 | +$$ |
| 64 | +\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log{\pi_{\theta}(a|s)A^\pi(s,a)}]. |
| 65 | +$$ |
| 66 | + |
| 67 | +For critic network, we rewrite the loss function as: |
| 68 | + |
| 69 | +$$ |
| 70 | +L(w)=\mathbb{E}[(r_t+\gamma V(s_{t+1};w)-V(s_t;w))^2]. |
| 71 | +$$ |
| 72 | + |
| 73 | +## Framework |
| 74 | + |
| 75 | +The structural framework of A2C, as implemented in XuanCe, is illustrated in the figure below. |
| 76 | + |
| 77 | +```{eval-rst} |
| 78 | +.. image:: ./../../../../_static/figures/algo_framework/a2c_framework.png |
| 79 | + :width: 80% |
| 80 | + :align: center |
| 81 | +``` |
| 82 | + |
| 83 | +## Run A2C in XuanCe |
| 84 | + |
| 85 | +Before running A2C in XuanCe, you need to prepare a conda environment and install ``xuance`` following the [**installation steps**](./../../../usage/installation.rst#install-xuance). |
| 86 | + |
| 87 | +### Run Build-in Demos |
| 88 | + |
| 89 | +After completing the installation, you can open a Python console and run A2C directly using the following commands: |
| 90 | + |
| 91 | +```python3 |
| 92 | +import xuance |
| 93 | +runner = xuance.get_runner(method='a2c', |
| 94 | + env='classic_control', # Choices: claasi_control, box2d, atari. |
| 95 | + env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc. |
| 96 | + is_test=False) |
| 97 | +runner.run() # Or runner.benchmark() |
| 98 | +``` |
| 99 | + |
| 100 | +### Run With Self-defined Configs |
| 101 | + |
| 102 | +If you want to run A2C with different configurations, you can build a new ``.yaml`` file, e.g., ``my_config.yaml``. |
| 103 | +Then, run the A2C by the following code block: |
| 104 | + |
| 105 | +```python3 |
| 106 | +import xuance as xp |
| 107 | +runner = xp.get_runner(method='a2c', |
| 108 | + env='classic_control', # Choices: claasi_control, box2d, atari. |
| 109 | + env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc. |
| 110 | + config_path="my_config.yaml", # The path of my_config.yaml file should be correct. |
| 111 | + is_test=False) |
| 112 | +runner.run() # Or runner.benchmark() |
| 113 | +``` |
| 114 | + |
| 115 | +To learn more about the configurations, please visit the [**tutorial of configs**](./../../configs/configuration_examples.rst). |
| 116 | + |
| 117 | +### Run With Customized Environment |
| 118 | + |
| 119 | +If you would like to run XuanCe's A2C in your own environment that was not included in XuanCe, you need to define the new environment following the steps in [**New Environment Tutorial**](./../../../usage/new_envs.rst). Then, [**prepapre the configuration file**](./../../../usage/new_envs.rst#step-2-create-the-config-file-and-read-the-configurations) |
| 120 | +``a2c_myenv.yaml``. |
| 121 | + |
| 122 | +After that, you can run A2C in your own environment with the following code: |
| 123 | + |
| 124 | +```python3 |
| 125 | +import argparse |
| 126 | +from xuance.common import get_configs |
| 127 | +from xuance.environment import REGISTRY_ENV |
| 128 | +from xuance.environment import make_envs |
| 129 | +from xuance.torch.agents import A2C_Agent |
| 130 | + |
| 131 | +configs_dict = get_configs(file_dir="a2c_myenv.yaml") |
| 132 | +configs = argparse.Namespace(**configs_dict) |
| 133 | +REGISTRY_ENV[configs.env_name] = MyNewEnv |
| 134 | + |
| 135 | +envs = make_envs(configs) # Make parallel environments. |
| 136 | +Agent = A2C_Agent(config=configs, envs=envs) # Create a A2C agent from XuanCe. |
| 137 | +Agent.train(configs.running_steps // configs.parallels) # Train the model for numerous steps. |
| 138 | +Agent.save_model("final_train_model.pth") # Save the model to model_dir. |
| 139 | +Agent.finish() # Finish the training. |
| 140 | +``` |
| 141 | + |
| 142 | + |
| 143 | +## APIs |
| 144 | + |
| 145 | +### PyTorch |
| 146 | + |
| 147 | +```{eval-rst} |
| 148 | +.. automodule:: xuance.torch.agents.policy_gradient.a2c_agent |
| 149 | + :members: |
| 150 | + :undoc-members: |
| 151 | + :show-inheritance: |
| 152 | +``` |
| 153 | + |
| 154 | +### TensorFlow2 |
| 155 | + |
| 156 | +```{eval-rst} |
| 157 | +.. automodule:: xuance.tensorflow.agents.policy_gradient.a2c_agent |
| 158 | + :members: |
| 159 | + :undoc-members: |
| 160 | + :show-inheritance: |
| 161 | +``` |
| 162 | + |
| 163 | +### MindSpore |
| 164 | + |
| 165 | +```{eval-rst} |
| 166 | +.. automodule:: xuance.mindspore.agents.policy_gradient.a2c_agent |
| 167 | + :members: |
| 168 | + :undoc-members: |
| 169 | + :show-inheritance: |
| 170 | +``` |
0 commit comments