Skip to content

Commit 295e63c

Browse files
authored
a2c_agent.md commit (#90)
* pdqn_agent.md commit1 * pdqn_agent.md commit2 * a2c_agent.md commit-1
1 parent 670cb90 commit 295e63c

File tree

3 files changed

+170
-26
lines changed

3 files changed

+170
-26
lines changed
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Advantage Actor Critic(A2C)
2+
3+
The Advantage Actor Critic(A2C) algorithm is a reinforcement learning method which is based on the policy gradient method and the value function approximation. It uses an advantage function instead of an action-state value function to improve the stability and performance of learning.
4+
5+
This table lists some general features about A2C algorithm:
6+
7+
8+
| Features of A2C | Values | Description |
9+
|-------------------| ------ | ------------------------------------------------------ |
10+
| On-policy || The evaluate policy is the same as the target policy. |
11+
| Off-Policy || The evaluate policy is different as the target policy. |
12+
| Model-free || No need to prepare an environment dynamics model. |
13+
| Model-based || Need an environment model to train the policy. |
14+
| Discrete Action || Deal with discrete action space. |
15+
| Continuous Action || Deal with continuous action space. |
16+
17+
## Actor Critic(AC) Framework
18+
19+
Actor Critic Method selects actions through the Actor, evaluates these actions through the Critic, and cooperates with each other to improve.
20+
21+
### Critic
22+
23+
Critic is also called value network which uses neural network $Q^\pi(s,a;w)$ to approximate the action value function $Q^\pi(s,a)$. In one-step Q-Learning, the parameters $w$ of the action value function $Q^\pi(s,a;w)$ are learned by iteratively minimizing a sequence of loss functions, where the $i$th loss function defined as
24+
25+
$$
26+
L_i(w_i)=\mathbb{E}[(r+\gamma Q(s',a';w_{i})-Q(s,a;w_i))^2]
27+
$$
28+
29+
where s′ is the state encountered after state s.
30+
31+
### Actor
32+
33+
Actor is also called policy network which is similar to the [**Policy Gradient(PG)**](./pg_agent.md) method. Actor directly optimizes the policy to achieve the maximum reward. Its' objective function is expressed in the following:
34+
35+
$$
36+
J(\theta) = \mathbb{E}_{\pi_{\theta}}{[\sum_{t=0}^{\infty}{\gamma^t r_t}]}.
37+
$$
38+
39+
To optimize the policy function $\pi_\theta$, we calculate the gradient of the objective function $J(\theta)$ with respect to the parameters $\theta$:
40+
41+
$$
42+
\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log{\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s, a)}].
43+
$$
44+
45+
In the process of training the actor, we use the action value $Q(s,a;w)$ obtained by the critic as an approximation of the $Q(s,a)$. By alternately training the actor and the critic, we ultimately achieve the goal of maximizing $J(\theta)$.
46+
47+
## Advantage Actor Critic(A2C)
48+
49+
In the above-mentioned Actor-Critic framework, we use $Q(s,a;w)$ to update the policy. In Advantage Actor Critic (A2C), we use the advantage function to represent the additional reward for taking a certain action relative to the average state value:
50+
51+
$$
52+
A(a_t,s_t)=Q(a_t,s_t)−V(s_t)\approx r_t+\gamma V(s_{t+1}) - V(s_t),
53+
$$
54+
55+
where
56+
57+
$$
58+
Q_\pi(a_t,s_t)−V_\pi(s_t)=\mathbb{E}[R_t+\gamma v_\pi(S_{t+1}) - v_\pi(S_t)|S_t=s_t].
59+
$$
60+
61+
The advantage function reduces the variance in policy gradient estimates. And then we can rewrite the gradient of the objective function $J(\theta)$:
62+
63+
$$
64+
\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log{\pi_{\theta}(a|s)A^\pi(s,a)}].
65+
$$
66+
67+
For critic network, we rewrite the loss function as:
68+
69+
$$
70+
L(w)=\mathbb{E}[(r_t+\gamma V(s_{t+1};w)-V(s_t;w))^2].
71+
$$
72+
73+
## Framework
74+
75+
The structural framework of A2C, as implemented in XuanCe, is illustrated in the figure below.
76+
77+
```{eval-rst}
78+
.. image:: ./../../../../_static/figures/algo_framework/a2c_framework.png
79+
:width: 80%
80+
:align: center
81+
```
82+
83+
## Run A2C in XuanCe
84+
85+
Before running A2C in XuanCe, you need to prepare a conda environment and install ``xuance`` following the [**installation steps**](./../../../usage/installation.rst#install-xuance).
86+
87+
### Run Build-in Demos
88+
89+
After completing the installation, you can open a Python console and run A2C directly using the following commands:
90+
91+
```python3
92+
import xuance
93+
runner = xuance.get_runner(method='a2c',
94+
env='classic_control', # Choices: claasi_control, box2d, atari.
95+
env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc.
96+
is_test=False)
97+
runner.run() # Or runner.benchmark()
98+
```
99+
100+
### Run With Self-defined Configs
101+
102+
If you want to run A2C with different configurations, you can build a new ``.yaml`` file, e.g., ``my_config.yaml``.
103+
Then, run the A2C by the following code block:
104+
105+
```python3
106+
import xuance as xp
107+
runner = xp.get_runner(method='a2c',
108+
env='classic_control', # Choices: claasi_control, box2d, atari.
109+
env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc.
110+
config_path="my_config.yaml", # The path of my_config.yaml file should be correct.
111+
is_test=False)
112+
runner.run() # Or runner.benchmark()
113+
```
114+
115+
To learn more about the configurations, please visit the [**tutorial of configs**](./../../configs/configuration_examples.rst).
116+
117+
### Run With Customized Environment
118+
119+
If you would like to run XuanCe's A2C in your own environment that was not included in XuanCe, you need to define the new environment following the steps in [**New Environment Tutorial**](./../../../usage/new_envs.rst). Then, [**prepapre the configuration file**](./../../../usage/new_envs.rst#step-2-create-the-config-file-and-read-the-configurations)
120+
``a2c_myenv.yaml``.
121+
122+
After that, you can run A2C in your own environment with the following code:
123+
124+
```python3
125+
import argparse
126+
from xuance.common import get_configs
127+
from xuance.environment import REGISTRY_ENV
128+
from xuance.environment import make_envs
129+
from xuance.torch.agents import A2C_Agent
130+
131+
configs_dict = get_configs(file_dir="a2c_myenv.yaml")
132+
configs = argparse.Namespace(**configs_dict)
133+
REGISTRY_ENV[configs.env_name] = MyNewEnv
134+
135+
envs = make_envs(configs) # Make parallel environments.
136+
Agent = A2C_Agent(config=configs, envs=envs) # Create a A2C agent from XuanCe.
137+
Agent.train(configs.running_steps // configs.parallels) # Train the model for numerous steps.
138+
Agent.save_model("final_train_model.pth") # Save the model to model_dir.
139+
Agent.finish() # Finish the training.
140+
```
141+
142+
143+
## APIs
144+
145+
### PyTorch
146+
147+
```{eval-rst}
148+
.. automodule:: xuance.torch.agents.policy_gradient.a2c_agent
149+
:members:
150+
:undoc-members:
151+
:show-inheritance:
152+
```
153+
154+
### TensorFlow2
155+
156+
```{eval-rst}
157+
.. automodule:: xuance.tensorflow.agents.policy_gradient.a2c_agent
158+
:members:
159+
:undoc-members:
160+
:show-inheritance:
161+
```
162+
163+
### MindSpore
164+
165+
```{eval-rst}
166+
.. automodule:: xuance.mindspore.agents.policy_gradient.a2c_agent
167+
:members:
168+
:undoc-members:
169+
:show-inheritance:
170+
```

docs/source/documents/api/agents/drl/a2c_agent.rst

-26
This file was deleted.

0 commit comments

Comments
 (0)