Skip to content
forked from Kaixhin/Rainbow

Rainbow: Combining Improvements in Deep Reinforcement Learning

License

Notifications You must be signed in to change notification settings

marintoro/Rainbow

 
 

Repository files navigation

This post describes our recent work on Deep Reinforcement Learning (DRL) on the benchmark Atari. DRL appears today as one of the closest paradigm to Artificial General Intelligence and large progress in this field has been enabled by the popular Atari benchmark. However, training and evaluation protocols on Atari vary across papers leading to biased comparisons, difficulty in reproducing results and in estimating the true contributions of the methods. Here we attempt to mitigate this problem with SABER: a Standardized Atari BEnchmark for general Reinforcement learning algorithms. SABER allows us to compare multiple methods under the same conditions against a human baseline and observe that previous claims of superhuman performance on DRL do not hold. Finally, we propose a new state-of-the-art algorithm R-IQN combining Rainbow with Implicit Quantile Networks (IQN). Code is available.

Deep Reinforcement Learning is a learning scheme based on trial-and-error in which an agent learns an optimal policy from its own experiments and a reward signal. The goal of the agent is to maximize the sum of accumulated rewards and thus the agent needs to think about sequence of actions rather than instantaneous ones. The Atari benchmark is valuable for evaluating general AI algorithms as it includes more than 50 games displaying high variability in the task to solve, ranging from simple paddle control in the ball game Pong to complex labyrinth exploration in Montezuma's Revenge which remains unsolved by general algorithms up to today.

Hovewer, we noticed that training and evaluation procedures on Atari could be different from paper to paper and thus leading to bias in comparison. Moreover this was leading to difficulties to reproduce results of published works as some training or evaluation parameter are barely explained or sometimes not mentioned. To allow more reproducible and comparable DRL, we introduce SABER: a Standardized Atari BEnchmark for general Reinforcement learning algorithms. Furthurmore, we introduce a human world record baseline and argue that previous claims of superhuman performance of DRL might not be accurate. Finally, we propose a new state-of-the-art algorithm R-IQN by combining the current state-of-the-art Rainbow along with Implicit Quantile Networks (IQN). Last but not least, we released an open-source implementation of distributed R-IQN following the ideas from the paper Ape-X!

DQN's human baseline vs human world record on Atari Games

A common way to evaluate AI for games is to let agents compete against the best humans. Recent examples for DRL include the victory of AlphaGo versus Lee Sedol for Go, OpenAI Five on Dota 2 or AlphaStar versus Mana for StarCraft 2. That's why one of the most used metric for evaluating RL agents on Atari is to compare them to the human baseline introduced in DQN.

Previous works use the normalized human score, i.e. 0% is the score of a random player and 100% is the score of the human baseline, which allows to summarize the performance on the whole Atari set in one number, instead of individually comparing raw scores for each of the 61 games. However we argue that this human baseline is far from being representative of the best human player, which means that using it to claim superhuman performance is misleading.

The current world records are available online for 58 of the 61 evaluated Atari game. On VideoPinball for example, the world record is 50 000 times higher than the human baseline of DQN! Evaluating these world records scores using the usual human normalized score has a median of 4 400% and a mean of 99 300% (see Figure below for details on each game), to be compared to 200% and 800% of the current state-of-the-art Rainbow!

drawing

We think that evaluating the algorithms under the world record baseline instead of the DQN human baseline will give a better view of the gap remaining between best human players and DRL agents!
And the results are clear, Rainbow reachs only a median human-normalized score of 3% (see Figure 2 below) meaning that for half of Atari games, the agent doesn't even reach 3% of the way from random to best human run ever!

We made a video with agents previously claimed as above human-level but far from the world record. By taking a closer look at the AI playing, we discovered that on most of Atari games DRL agents fail to understand the goal of the game. Sometimes they just don't explore other levels than the initial one, sometimes they are stuck in a loop giving small amount of reward etc... A compelling example of this is the game Riverraid. On this game, the agent must shoot everything and take fuel to survive: the agent die if there is a collision with an enemy or if out of fuel. But as shooting fuel actually gives point, the agent doesn't understand that he could play way longer and win way more point by actually taking this fuel bonus and not shooting them!

SABER: a Standardized Atari BEnchmark for general Reinforcement learning algorithms

In our paper we extended the recommendation introduced in Revisiting the ALE. The authors of this work already pointed out some divergences in training and evaluating agents on Atari. They proposed a set of recommendations for more reproducible and comparable RL including sticky actions, ignoring life signal and using full action set.

One major parameter is left out of this work: the maximum number of frames allowed per episode. This parameter ends the episode after a fixed number of time steps even if the game is not over. In most of recent works, this is set to 30 min of game play (108k frames) and only to 5 min in some others (18k frames). This means that the reported scores can not be compared fairly. For example, in easy games (e.g. Atlantis), the agent never dies and the score is more or less linear to the allowed time: the reported score will be 6 times higher if capped at 30 minutes instead of 5 minutes.

Another issue with this time cap comes from the fact that some games are designed for much longer gameplay than 5 or 30 minutes. On those games (e.g. Atlantis, Video Pinball, Enduro) the scores reported of Ape-X, Rainbow and IQN are almost exactly the same. This is because all agents reach the time limit and get the maximum score possible in 30 minutes: the difference in scores is due to minor variations, not algorithmic difference and thus the comparison is not significant. As a consequence, the more successful agents are, the more games are incomparable because they reach the maximum possible score in the time cap while still being far behind human world record.

This parameter can also be a source of ambiguity and error. The best score on Atlantis (2,311,815) is reported by Proximal Policy Optimization by Schulman et al. but this score is almost certainly wrong: it seems impossible to reach it in only 30 minutes! The first distributional paper by Bellemare et al. also did this mistake and reported wrong results before adding an erratum in a later version on ArXiv.

Moreover, the human high scores for Atari games have been achieved in several hours of play, and would have been unreachable if limited to 30 minutes thus making impossible agents to reach superhuman performance. That's why we advocate to use infinite length time episode and only terminate when game is over or agent is stuck. This leads to the SABER benchmark, a set of training and evaluation procedures allowing for fair comparison and for reproducibility, which parameters are sum up below.

drawing

Rainbow-IQN: Comparison with Rainbow on SABER

We first re-implemented the current state-of-the-art Rainbow and evaluated it on SABER. We discovered the same algorithm under different evaluation settings can lead to significantly different results. This showed again the necessity of a common and standardized benchmark, more details can be found in our paper.

Then we implemented a new algorithm, R-IQN, by replacing the C51 algorithm (which is one of the 6 components of Rainbow) by Implicit Quantile Network (IQN). Both C51 and IQN belong to the field of Distributional RL which aims to predict the full distribution of the Q-value function instead of just predicting the mean of it. The fundamental difference between this two algorithms is how they parametrize the Q-value distribution. C51, which is the first algorithm of Distributional RL, approximates the Q-value as a categorical distribution with fixed support and just learns the mass to attribute to each category. On the other hand, IQN approximates the Q-value with quantile regression and both the support and the mass is learned resulting in a major improvement in performance and data-efficiency over C51. As IQN arises much better performance than C51 while still designed for the same goal (predict the full distribution of the Q-function), combining Rainbow with IQN is relevant and natural.

As showed in the graph below, R-IQN raises better performance than Rainbow and thus become the new state-of-the-art on Atari. However, to make a more confident SOTA claim it should be run multiple times with different seeds.

drawing

Open-source implementation of distributed Rainbow-IQN: R-IQN Ape-X

We released our code of a distributed version of Rainbow-IQN following ideas from the paper Ape-X: Distributed Prioritized Experience Replay.
The distributed part is our main practical advantage over some existing DRL repositories (particularly Dopamine a popular open-source implementation of DQN, C51, IQN and a small-Rainbow but in which all algorithms are single worker).
Indeed, when using DRL algorithms for other tasks (other than Atari and MuJoCO) a major bottleneck is the speed of the environment. DRL algorithms often need a huge amount of data before reaching reasonable performance. This amount may be practically impossible to reach if the environment is real-time and you cannot collect data from multiple environments at the same time.
Following the ideas from the paper Ape-X, we use REDIS as a side server to store our replay memory. Multiple actors will act in their own instances of the environment to fill as fast as they can the replay memory. Finally, the learner will sample from this replay memory (the learner is actually totally independant of the environment) to make the backprop. The learner will also periodically update the weight of each actors as shown in schema below (image taken from the original Ape-X paper).

drawing

This scheme allowed us to use R-IQN Ape-X for the task of autonomous driving using CARLA as environment. That led us to win the CARLA Challenge on Track 2 Cameras Only showing the strength of R-IQN Ape-X as a general algorithm.

About

Rainbow: Combining Improvements in Deep Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%