output | ||||
---|---|---|---|---|
|
For the course:
Implementing Artificial Neural Networks (ANNs) with Tensorflow (winter term 2019/20)
Topic:
A2C for continuous action spaces applied on the LunarLanderContinuous environment from OpenAI Gym
Participants:
Jonas Otten
Alexander Prochnow
Paul Jänsch
- Introduction/Motivation
- Theoretical Background
- Project development log
- The model and the experiment
4.1 Original approach with MLP policy network
4.2 An alternative actor: Recurrent policy network - How to run training and testing
- Results
6.1 MLP
6.2 GRU - Discussion
- References
As a final project of this course, one possible task was to identify an interesting research paper in the field of ANNs and reproduce the content of the paper with the knowledge gained during the course. In our case, we first decided on what we wanted to implement and then looked for suitable papers on which we can base our work.
Inspired by the lecture about Deep Reinforcement Learning (RL) held by Leon Schmid we wanted to take the final project as an opportunity to gather some hands-on experience in this fascinating area. So Reinforcement Learning it is. But where to start? We were looking for something which offers a nice tradeoff between accessibility, challenge and chance of success. The Gym environments provided by OpenAI seemed to do just that. Most of them are set up and ready to run within a minute and with them, there is no need to worry about how to access observations or perform actions. One can simply focus on implementing the wanted algorithm.
Speaking of which, from all the classic DRL techniques we knew so far, the Synchronous Advantage Actor-Critic algorithm (short A2C) seemed most appropriate for the extent of the project. Challenging but doable in a few weeks. This left us with two remaining questions to answer before we could start our project.
First, which environment exactly should our learning algorithm try to master using A2C? Since there are better solutions than A2C for environments with discrete action spaces, Leon recommended us to go with the LunarLanderContinuous environment.
And second, which A2C related papers provide us with the necessary theoretical background and also practical inspiration on how to tackle the implementation? An answer to this question we would like to postpone to later sections. Especially the next section about our theoretical background will show on which knowledge our project is built and from where it emerges. Later sections about the practical implementation also contain references to important resources.
In RL an agent is interacting with an environment by observing a state
We assume that the environment is modeled by a Markov decision process (MDP), which consists of a state transition function
In RL the goal is to maximize the cumulative discounted return at each timestep
with
In order to overcome this problem function approximation has been used to find a good solution for policy
A big problem in the usage of DNNs for RL is the difficulty of computing the gradient in methods, which estimate the policy
A reformulation of the gradient of the reward function called the policy gradient theorem (proof: Sutton & Barto, 2017) avoids the calculation of the derivative of the stationary distribution:
$$ \begin{aligned} \nabla_{\theta} J(\theta) &= \nabla_{\theta} \sum_{s \in S} d^{\pi}(s) \sum_{a \in A}{Q^{\pi}(s,a)) \pi_{\theta}(a|s)} \ &\propto \sum_{s \in S} d^{\pi}(s) \sum_{a \in A}{Q^{\pi}(s,a)) \nabla_{\theta} \pi_{\theta}(a|s)} \ &= \mathbb{E}{\pi}[Q^{\pi}(s,a) \nabla{\theta} \ln{\pi_{\theta}(a|s)}] \end{aligned} $$
This formula holds under the assumptions that the state distribution
An algorithm that makes use of the policy gradient theorem is the actor-critic method. The critic updates the parameters
Sources used:
- the book Reinforcment Learning: An Introduction 1
- the A3C paper2
- Lilian Weng's blogpost about Policy Gradient Algorithms3
- A2C Code provided by OpenAI
Here we describe how we approached the given problem, name the steps we have taken and expound the motivation for the decisions we made in the process of this project. (Readers only interested in the final result with explanations to the important code segments can skip this part and can continue with the paragraph "The model and the experiment")
Instead of directly heading into the complex case of an environment with continuous action space, we decided to first start with a simpler version of A2C. Namely, A2C for a discrete action space and without parallelization. For this, we took the CartPole gym environment. Mastering this environment was the objective of phase 1, which also can be seen as a pre-phase to phase 2 (the main phase)
Phase 1:
- getting the gym environment to run
- setting up two simple networks for the actor and the critic
- using the actor network to run one agent for arbitrarily many episodes and save the observations made
- using the saved observations to train both actor and critic based on the estimated return
Even with our simple network architecture, we were able to observe a considerable learning effect, finally leading to our agent mastering this simple environment. Although the training result was not stable enough (after several successful episodes the agent started to get worse again) we decided to not optimize our setup on the CartPole environment, but instead switching to an environment with continuous action space and optimizing our learning there. Which leads us to phase 2.
Phase 2:
- changing to the LunarLanderContinuous gym environment
- dividing the current jupyter notebook into separate python files(main.py, coordinator.py, agent.py, actor.py and critic.py)
- the agent now contains the
- creation of the environment,
- running an episode and saving the observations
- computing the gradients for both networks and returning them to the coordinator
- the coordinator
- creates the agent
- tells the agent to run an episode based on the current actor
- and uses the returned gradients to update the networks
- modifying the network architecture of the actor to match the new action space: it now has to return two pairs of mean and variance values, each pair describing one normal distribution from which we sample the action for the main and the side engine
- at this point we decided to implement parallel computing of episodes with multiple agents to speed up the learning (because up to this point we were not able to see any useful learning):
- we looked at different parallelization packages and after some testing, we decided to go with Ray
- Ray allowed us to run multiple agents on our CPUs/GPUs in parallel and with this significantly boosting our learning
With the speed-up provided by the parallelization and further fixes of minor but sometimes critical issues, we were finally able to observe our agents learning useful behaviour in the LunarLander environment up to the point where the Lander actually stopped crashing down on the moon every single time and made its first successful landings. That's one small step for the RL research, one giant leap for our team.
But we were not quite satisfied with the result yet. The learning process was still very slow and so we decided to add one more ingredient: Long short-term memory or LSTM for short. Adding LSTM to the actor network is said to greatly improve its performance. Further, it might enable our agents to solve other environments, like the BipedalWalker, which require some kind of longer-lasting memory.
We advanced into the last phase of our project, which mainly deals with improvements like the implementation of LSTM but also with cleaning, restructuring and polishing the code to achieve its final form.
Phase 3:
- LSTM implementation:
- adding pre-build LSTM-Layers by Keras to the actor network
- expanding the parameter list of the actor's constructor such that one can choose whether the network should use the newly added LSTM layers or the previously used Dense layers
- the model does not train because states do not get reset in the forward pass during one update, training collapses after reaching a particular threshold at the cumulative return
- exchanging Keras LSTM-Layers by custom GRU Cells and implement state resetting through masks passed with the input
- the model trains
- set up checkpointing for custom layers
- have code infer parameters from environment
- adding an ArgumentParser to the main.py to allow for different settings to be used when calling the main.py (test/training run, type of actor network, number of agents used, type of environment)
- cleaning the code:
- removing old unused code parts
- adding necessary comments to the code
This section makes up the main part of our report. Here we will highlight and explain the important parts of our project's implementation. We are trying to present the code in the most semantic logical and intuitive order to facilitate comprehension. The code itself is already structured into several classes and we will always indicate which class we are currently talking about.
We are starting with the coordinator class because, as its name suggests, it organizes the use of every other class and also the whole procedure of the learning process. From there we will go step by step and jump into the other classes as they are coming up.
The instantiation of the coordinator happens in the main.py (Figure 1) and the execution of its __init()__
method initializes everything needed for successful learning. The most crucial point in this part is probably the instantiation of the two Neural Networks which build the core of the A2C method, namely the Actor and the Critic.
As one can see here, network-related things like the loss function and the optimizers are also created at this point. But let's take the chance to go into both classes and look at the architectures of the networks (Figure 3 & 4).
The Critic:
The Actor:
This is just to gain a quick overview of the networks for now, as we will explain our choice of e.g. activation functions as they become more apparent.
Back in the __init()__
method of the coordinator, there is one more important step to talk about. The creation of the agents which will run on the environment in a parallel manner.
The instantiation of the agents exhibits an anomaly: the keyword .remote
. This is necessary because the agent class is declared as a Ray remote class, which has the following implications when instantiated:
- Instantiation must be done with
Agent.remote()
instead ofAgent()
as seen in the screenshot - A worker process is started on a single thread of the CPU
- An Agent object is instantiated on that worker
- Methods of the Agent class called on multiple Agents are executed on their respective worker and can therefore execute in parallel, but must be called with
agent_instance.function.remote()
- Returns of a remote function call now return the task ID, the actual results can be obtained later when needed by calling
ray.get(task_ID)
After instantiation we assign the first agent to be the "chief" (Figure 5). His environment will be rendered during training, while the environments of the other agents will run in the background. This adds a fun way to watch the performance of our AI, other than graphs and numbers (not that those are not fun, too).
Besides the Ray specific specialties, the agent class still has a normal __init()__
method on which we want to have a short glance now:
Noteworthy here is the creation of the OpenAI Gym environment in which the agent will act (Figure 6, Line 11) and the instantiation of the agent's memory (Figure 6, Line 22). The memory is represented as an object of our Memory class. As expected an object of this class is responsible for storing the observations an agent makes temporally. This includes states visited, actions taken, rewards received, information whether a terminal state is reached and, not being an observation in particular, a return estimate. We will have a look at important methods of the Memory class when we are dealing with the agents executing actions and making observations.
The rest of the coordinator's __init()__
handles the preparation of the tensorboard in order to be able to inspect the training progress.
Now that our coordinator is fully operational we can start the training by calling its train()
method in the main.py (Figure 1, Line 35).
This method is the heart of the coordinator and will be assisted by quite a lot of helper methods and also some other classes we did not talked about in detail yet. We will go through all of them and explain their use in the order they are needed in the train()
method.
First, we advance the environments a number of timesteps equal to our hyperparameter num_steps
by calling step_parallel(t)
accordingly (Figure 7, Line 73-74). Then, in later parts of the train()
method, we use the collected observations to update the networks. This way we update the network parameters only every num_steps
(e.g. 32) timesteps. Before we can get into how we update the networks though, we must first have our agents act in the environment and return observations to us. This is the purpose of the step_parallel(t)
method. It advances all environments from timestep t to t+1 by observing the current state and then computing an action to perform (Figure 8).
Observing the current states of the environments of multiple agents can be done in parallel by calling the agent.observe()
function on all agents (Figure 8, Line 132). Being a remote function, our list comprehension will return a list of task IDs and not the actual states, therefore we must call ray.get()
to obtain them. Taking a look at the observe()
function (Figure 9) we notice that if we are at the start of a new update, it will reset the agent's memory since we only want to take observations made in the current update cycle into account for the current network update (Figure 9, Line 26-27). We will elaborate on the memory class in the coming section. For now, all we want is the current state, which is stored in the self.state
attribute of the agent. If the previous episode was finished the environment will be reset and the attribute will instead contain the initial state of the new episode (Figure 9, Line 30-31).
Returning the current state of every agent to the coordinator, we are now ready to compute our next action for each agent. As described previously: In Lunar Lander, our agent's action consists of two values, one controlling the main engine, the other the side engines. The values can take on virtually any real number within [-1, 1]. We sample these values from two normal distributions per agent (Figure 8, Line 133-135), each with parameters mu and sigma, denoting the location and scale of the distribution. These parameters (one mu and sigma for the main engine distribution and one mu and sigma for the side engines distribution per agent) are the output of our Actor neural network. To compute them, we call the get_action_distribution()
function (Figure 10), which passes the current states of all environments to the actor network (Figure 4). It returns the mentioned mus and sigmas, which we use to create normal distribution objects (Figure 10, Line 145) that will now be sampled from (Figure 8, Line 135).
At this point the reasons for our architectural choices for the actor network become apparent: The tanh of the mu output layer (Figure 4, Line 27) keeps the center of our normal distributions, i.e. the average of our sampled values within [-1, 1], which is useful since this is exactly the action space. Similarly, for the sigma output layer (Figure 4, Line 28), a softplus activation ensures that the mathematical restrictions of the standard deviation are upheld, namely
Lastly, to complete our step in the environment, we execute the computed actions (Figure 8, Line 138). A deep dive into the agent's execute function is required before we return the agent's memories to the coordinator. That is because we have to form the agent's memories first. Let us take a look at how this is done.
The agent performs the action given on the environment and stores the resulting state, reward and done flag returned by the environment, then updates the internal state self.state
and the finished flag (Figure 11, Line 50-53).
His observations are stored by the memory object instantiated from our Memory class (memory.py). It is initialized in the agents __init__
as seen before (Figure 6, Line 22) and possesses numpy arrays to store states, actions, rewards, estimated returns and terminal booleans denoting whether a terminal state is reached. The agent's memory starts of empty (Figure 12, Line 10-15). Observations can be stored in the arrays via the index representing the timesteps (Figure 12, Line 17-22).
Now that the memories of the agents are filled with the exciting experiences of one timestep, they are eager to return them to the coordinator. But we have taught them well, so that they will only return them to the coordinator when the required num_steps
is reached (Figure 11, Line 56-59). Until then they repeat the observe and execute routine and only afterwards collectively return a list containing every agent's memory object to the coordinator.
Now it is time for the coordinator to utilize these memories to make the agents better.
We do this by first computing the discounted cumulative return (Figure 13 Line, 76). As described in the theoretical background, our goal is to maximize the cumulative discounted return at each timestep. We had defined it as self.estimated_return
. This is calculated by iterating over the reversed list of rewards (Figure 14, Line 54) and summing up all rewards, but discounting future rewards more heavily (Figure 14, Line 57), e.g. $G_1 = R_{t=1} + \gamma \cdot R_{t=2} + \gamma^2 \cdot R_{t=3} + ...$. Our function does this in an unintuitive manner, but it becomes apparent if one would go through this with an example: Looking at a reward list with only 3 entries at the end of an episode: Our cumulative_return
gets initialized with 0. Our estimated_return[3]
for timestep 3 is self.rewards[3]
). Our cumulative_return
is now estimated_return[1]
) is
Back in the coordinator, concatenating all the made observations across all agents can then be done using the sum function (Figure 13, Line 77), as we have adjusted the memory class's __add__
behavior method, i.e. what happens when adding two memory objects together, namely that their observations are concatenated.
The memory object of the coordinator now contains the collective memory of all agents and their discounted returns. These are needed to compute the actor loss and critic loss, which we want to minimize, so we compute their gradients. This is coordinated by the _get_mean_gradients()
function (Figure 13, Line 80, Figure 15). Since we have two networks, two gradients are computed: The policy gradients maximize the actor loss, therefore maximizing the estimated return (Figure 15, Line 161). This might be unconventional since the name "loss" suggests that it should be minimized, but as it is the derivative of the reward function, we want to maximize it. The term contains a log probability and it will in fact converge to 0, because the logarithm of probability values (which are between 0 and 1) is always
The critic gradients minimize the critic loss, which will minimize the Mean Squared Error for the state value function (Figure 15, Line 163).
Let's look at how we calculate the two losses. Firstly, we see that the final actor loss (Figure 16, Line 170) is adjusted by an entropy term. Adding the entropy term to the actor loss has been found to improve exploration, which minimizes the risk of convergence to an only locally optimal policy (A3C paper page 4). This adds a new hyperparameter, the entropy coefficient (ENTROPY_COEF
), which balances the amount of exploration.
The unmodified actor loss is returned from our _actor_loss
method, which first estimates the state value by passing all states to the critic (Figure 3). In the Lunar Lander environment, a state is a vector of 8 values, denoting different aspects within the environment, e.g. the coordinates of the vessel. So our critic takes this state vector as an input and outputs the state value. Applying L2-regularization has improved our critic loss during training (Figure 3, Line 10).
The state values are now used to get an estimate of the advantage (Figure 17, Line 183).
But our actor loss also consists of a second part. As described in the theoretical background, our actor gradients are updated via
Using our previously mentioned get_action_distribution
function (Figure 10), we recompute the normal distributions that we sampled our performed actions in each respective state from (Figure 17, Line 186). Inputting the recorded action back into the normal distribution's log probability density function returns us the relative log probability
Our actor loss therefore consists of the log probability and the advantage, which is used as a baseline for the log probability here to reduce the variance of the policy gradients (A3C paper page 3). These gradients must still be computed from the actor loss, which we do by using the tensorflow gradient tape (Figure 16, Line 177).
Having finished computing the policy gradients, we now move on to the critic gradients (Figure 15, Line 163). For those, we calculate the Mean Squared Error between the discounted cumulative returns (Figure 14) and the state values, which are again obtained from pushing the observed states through the critic network (Figure 16, Line 172-175). As we did with the actor loss, gradient of the critic loss is now computed with the help of the gradient tape (Figure 16, Line 177) and the policy and critic gradients are now returned to the coordinator (see below: Figure 13, Line 80).
Finally, we can let the Adam optimizers travel through the loss spaces roughly oriented towards the direction of greatest descent given by our gradients, i.e. applying the gradients to update our models (Figure 13, Line 81-82).
Our A2C algorithm has now run through one iteration and performed a single coordinated update of our two networks. All agents will now start their next set of steps in the environment with the same network parameters, which is the before mentioned specialty of the A2C algorithm compared to A3C.
As an alternative neural network for the policy of the actor, we implemented a recurrent one in order to utilize the advantages of the propagation of hidden states. By doing so, we relax the Markov property of the Markov decision process we assume because information about previous states gets taken into account to compute the distribution over the actions. We set up the hypothesis that the information about the impact of previously taken actions on the state of the environment positively influences training speed and task performance.
To set up a recurrent network we used gated recurrent units (GRU). These recurrent cells can be seen as a variation of the Long short-term memory (LSTM) cell. Similarly, a GRU cell utilizes gates, namely an update gate
with
$$ \begin{aligned} {h'}t &= \text{tanh}(w_h x_t + u_h (r_t \odot h{t-1})) \ h_t &= z_t \odot h't + (1-z_t) \odot h{t-1} \end{aligned} $$
The complete operations happening in one GRU cell are depicted in Figure 18 and the notations in Figure 19. For a more detailed review of the GRU view the work by [Chung et al. [2014]]4.
In the implementation of a recurrent neural network for the policy we faced some serious issues. The main challenge was imposed by the handling of the hidden states during the collection of observations and in the update of the network. It is difficult to reset the hidden state when the next timestep included in one batch belongs to a new episode. That is the reason why we chose to implement a custom GRU Cell with the Keras API: In order to provide a mask for the hidden state, which resets it at terminal timesteps. This mask just contains zeros for each terminal state and ones for every other state (Figure 20, Line 121 - 125).
The hidden states returned by the actor (Figure 4) are saved between steps of the agent on the environment and between updates. During the update, the shape of the batch is three dimensional. The first dimension represents the agents, the second dimension the timesteps and the last the state vector (Figure 21). We found no solution to reset the hidden states of terminal timesteps during the forward pass of multiple timesteps at once without implementing a custom GRU. The recurrent policy returns mu and sigma of the action distribution in the same manner as the feed-forward policy.
We provide the mask together with the input for the recurrent cell and change its shape corresponding to the shape of the hidden state of the cell defined by the number of units (Figure 22, Line 73 - 77). After that, we compute the values of the gates and the current hidden state
Having read our implementation, we hope you are now eager to try it out!
We have included our trained models, which were saved using the tf.train.Checkpoint
function. These can be tested by calling python3 main.py --test
, which will automatically load the trained models and render the environment. We have also added further arguments to the command line parser, which can be viewed using python3 main.py --help
. Most notably the policy network type can be changed here (--network_type "mlp"
or --network_type "gru"
). The number of agents for training can be changed as well, e.g. --num_agents 12
. To train the model, use --train
.
Note that this will create new checkpoints throughout the training, meaning that when running --test
afterwards it will load the new checkpoints and not our trained model anymore.
In this section we will present the results of our experiment. In particular, we will highlight the performance of the actor with different policy networks. The training runs depicted here where performend with the default parameters given in the code.
On the leaderboards for the OpenAI Gym environments, LunarLanderContinuous-v2 is defined as being "solved" when getting an average reward of 200 over 100 consecutive trials. Both our MLP and GRU networks were able to achieve this. To test this we built a function tracking the reward for each trial into our main.py --test
. After 100 trials, the program will end and output the average reward over those trials. This has been around 240 for our MLP network after having trained for 4684 episodes (summing the number of episodes each agent performed). Our GRU network has achieved a slightly lower average reward of 229, but it has reached this level of performance after having trained for only 2472 episodes.
Now we will take a closer look at the performance of the MLP policy during a successful training run.
In Figure 23 the cumulative return of one agent is plotted against the number of steps taken in parallel after each episode. At the beginning of the training run the variability and the number of datapoints is quite high. This higher data point density happens because the agent crashes the Lunar Lander after a small number of timesteps, whereas later on, when performance has improved, the Lunar Lander will more slowly descend to the surface of the moon, resulting in fewer data points per timesteps. With even greater performance, the agent learns to balance the tradeoff between safe landing and fuel consumption, which results in a negative reward. Thus, the agent tries to land safely with as little fuel as possible, resulting again in a quicker descent and therefore increasing the number of datapoints per step again. Between 200 and 300 the mean cumulative return converges, after about 120,000 timesteps it stops increasing notably and at roughly 200,000 steps the policy starts to overfit. The overfitting after 200,000 timesteps causes the average cumulative return across 100 trials to be 209, which is 31 points lower than our model after 120,000 steps. In this case taking a model from an earlier checkpoint (i.e. early stopping) has improved performance.
The actor loss (Figure 24, 25) given by $\mathbb{E}{\pi}[A{w}^{\pi} \ln{\pi_{\theta}(a|s)}]$ oscillates around 0. While the log probability
Similarly, the critic loss (Figure 26) converges to zero as per usual when using Mean Squared Error as the loss function. Of interest here is that the return estimate depends on the policy. That is why the variability of the loss is that high since the policy steadily changes so does the value that the critic tries to estimate. So it is only possible for the critic loss to converge near zero when the policy converges and the actions performed in each state are similar.
In this section we will quickly highlight the performance of the GRU policy during training.
We were able to solve the environment with the GRU policy and even faster than with the MLP policy. Nevertheless, with our current configuration training is not stable. The performance of the agent decreases quickly after it starts converging and the cumulative return (Figure 26) drops far below zero.
The reason why the policy collapses might be that it changes fast and in large steps. Since the critic depends on the policy (return) and vice versa (advantage), too large steps might lead to a downward spiral in which the perfomance decreases rapidly because actor (Figure 27, 28) and critic (Figure 29) are not able to adapt fast and correct. The policy loss even takes NaN as value at the end because it is dropping to values below the computationally possible.
This section will cover our discussion of the project as a whole.
Looking back, we are proud to have trained a model using Reinforcement Learning and implemented the A2C algorithm with efficient parallelization. We could test our understanding of a RL-paper and even improve upon the algorithm using our GRU implementation. Working as a team and collaborating with git has been fun and effective.
For future projects we would put a greater emphasis on storing summary statistics of training runs more meaningfully to have a backlog of how our model evolves over time. In addition, the implementation could still be improved and extended upon, e.g. by including experience replay to increase sample efficiency 5. Also integrating trust region policy optimization from the TRPO algorithm might improve training stability by not increasing the policy parameters too much in a single step 6. Furthermore, the GRU network could be replaced by an attention model, e.g. a transformer. Attention networks outperform recurrent networks in many fields, since they maintain utilization of past information, by directing their “attention” towards particular subsets of the data. This enables better parallelization because the data/timesteps must not be forwarded sequentially 7.
Footnotes
-
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. ↩
-
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). ↩
-
Weng, Lilian. "Policy Gradient Algorithms". In lilianweng.github.io/lil-log. "https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html" (2018). ↩
-
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. ↩
-
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). ↩
-
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., & de Freitas, N. (2016). Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224. ↩
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). ↩