The aim of my work is to improve the reinforcement learning agent from [1] which achieves state-of-the-art results in learning sensorimotor control from raw sensory input in complex and dynamic three-dimensional environments, learned directly from experience. The improvement is done by incorporating additional information to the agent such as segmentation of objects or depth detection of the raw image.
The experiments are conducted on the FPS game Doom through the AI Researh Platform for RL ViZDoom.
Agent trained to play ViZDoom on a particular environment helped by segmentation and depth detection.
![]()
![]()
Before beginning the description of the core issues, I have to acknowledge the work of [3]. In this paper, similar experiments are described and the main ideas inspired me and helped me to understand the problematic and what I had to do to solve it.
In this section, we briefly develop the approach taken by [1] to address learning sensorimotor control problem.
The usual paradigm in RL is composed of an agent (e.g : player) interacting with an environment (e.g : Doom map). The agent is able to perform a known set of actions (e.g : shoot, go left, go right, ..). The interaction between the agent and the environment is materialized by a (stochastic) reward (e.g : +1 for each new time step the agent lives) and an observation (e.g : the raw image of the game) which will be useful to perform the next action.
The goal of this approach is to maximize the (expected) sum of future rewards.
This approach might not be adapted to learning sensorimotor control from raw sensory input in three-dimensional environments. Indeed scalar rewards are unable to describe completely multi-dimensional environments. Instead of this, we should rather focus on the measurements provided to the agents (e.g : the number of frags and the health).
The approach developed by [1] is Future supervised learning. The goal is to predict measurements available to the agent. For this purpose, it is assumed that the rewards can be expressed as (linear) function of measurements. In this configuration, the role of the agent is to predict the future measurements (at multiple future time steps) implied by each possible action based on raw input, and present measurements. From each future measurements, the expected reward is deduced.
Eventually, the main goal of this approach is to learn a predictor function, i.e : a function able to predict future measurements from raw input and present measurements for each possible actions. This function is approximated by a neural network.
[1] to see the details of the neural network, the results and comparisons to other methods (e.g : DQN and A3C)
In this section, we introduce visual tools which will provide intermediate representations to the DFP agent in order to help it choose the best actions.
The main focus of my project is to find out if providing intermediate representations to the DFP agent will help it learning better policies. For this purpose we will focus on semantic segmentation and depth detection.
Semantic segmentation is the partition of an image into coherent parts. For example classifying each pixel that belongs to a person, a car, a tree or any other entity in a dataset.
Here is an example of semantic segmentation provided by the ViZDoom environment :
Depth can be defined as the distance from the camera for each pixel in the image frame.
Here is a color map of depth provided by the ViZDoom environment.
U-Net [2] : An ecoder-decoder architecture to predict intermediate representations
To automate segmentation and depth detection of frames, we aim to train a neural network. As it is done in [3] The architecture I use is U-Net [2]. The architecture contains two paths. First path is the encoder which is used to capture the context in the image. The second path is the decoder which is used to enable precise localization. Originally, the output tensor of U-Net has lower height and width than the input image yet we need the output to have the same dimensions in order to be concatenated to the raw image and provided to the DFP agent. Thus, I modified the original architecture for this purpose.
In this subsection, I present qualitative results of U-Net on segmentation task and depth detection. For these tests, very few train images (2000 examples) were used and the training was done on very few epochs (4 epochs). It shows the efficiency of this model since excellent results can be observed. The optimizer used is stochastic gradient descent. The implementation is done in PyTorch.
For semantic segmentation. The goal of the model is to predict the category of each pixel. The categories considered in this project are : WALL, CEILING, FLOOR, ENEMY, ITEM, OTHER. For each category and for each pixel a probability that the pixel belongs to the category is predicted. Thus, the cross-entropy loss is minimized.
Input Prediction Ground Truth
For depth detection. The goal is to predict the distance of the object represented in a pixel to the doom player. Thus, the mean square error was minimized.
Input Prediction Ground Truth
Eventually, based on U-Net architecture and DFP agent, we build a model that takes as input the raw image, the present measurements and the goal, to predict the future measurements and to take action, exactly as the DFP agent. The main difference is the intermediate step, composed of a segmentation predictor and depth predictor which takes the raw image as input and outputs its segmentation and depth representations. All these three images are concatenated and provided to the DFP agent.
- python3.6
- tensorflow==2.0.0
In fact the authors of [1] used tensorflow1 but for Cuda compatibility issues I use tensorflow2 and in the code I replace the following line :
import tensorflow as tf
by
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
- pytorch==1.3.1
- scikit-video==1.1.11
- numpy==1.17.4
To run the code :
$ cd examples/D3_battle
You can change
D3_battle
to any other folder in examples (other configurations, see [1] for more details.) To run without any intermediate representations
$ python run_exp.py
To run with predicted intermediate representations
$ python run_exp.py --vision 1
To run with groundtruth intermediate representations
$ python run_exp.py --vision 1 --groundtruth 1
I reuse the code provided with [1] I retrieve the DFP agent provided and I modify it in order to integrate the computer vision models and the intermediate representations. The architecture of the code is mainly composed of the DFP agent interacted with a memory. The memory is composed of a certain amount of states and the groundtruth predictions. It feeds the agent which learns the policy. Conversely, the agent takes actions according an epsilon-greedy policy (epsilon decreasing with time) in order to provide new states to the memory.
As the agent is learning and its policy is improving, the agent is likely to discover new states and the distribution of the states might be very different over time. It is a problem for the computer vision trained model. Indeed, if they are trained only at the begining, they will very likely be wrong on never seen states. It is known as the "covariate shift in behavioral cloning". To handle this problem, I have decided to interupt the learning of the agent at regular times for the depth model and the segmentation model to be trained on the new states.
In this section, conditions and parameters of the experiments are described. The experiments were conducted on the Doom environment D3 from [1].
The agent has a first person view of the environment and act on the raw image that is shown to human players (with potentially visual tools). In this scenario the goal of the player is to defend against attacking ennemies. Health and ammunition can be gathered. The measurements available to the agent are : health, frags and ammunition. The possible actions are : move forward, move backward, turn left, turn right, strafe left, strafe right, run, and shoot. Any combination of actions is also an action.
In the orginal paper, the agent is trained over 800,000 mini-batch iterations. For a matter of ressources I reduce the training to 50,000 mini-batch iterations. To retrieve examples, the agent runs experiment and chooses action following an epsilon-greedy policy with the parameter epsilon decreasing with the training step. To speed up convergence, I make the parameter epsilon decrease faster so that exploration is reduced. Eventually, the policy obtained is certainly sub-optimal. Nevertheless, all experiments are conducted in the exact same conditions so it is a good basis to evaluate the improvements.
Three different agents are trained in the conditions described above. The first one is trained without any intermediate representations. The second one is trained with groundtruth segmentation and depth. The last one is trained with Unet predictions of segmentation and depth.
The evaluation of the agent is done multiple time during training : each 1000 mini-batch iterations, the policy learned (i.e the predictor function associated to the greedy policy) is tested over 20000 steps (A step corresponds to one action). After that, the mean measurements and rewards over these episodes are provided.
The following table summarizes the parameters of the experiments :
Parameter | Value |
---|---|
future steps predicted | [1,2,4,8,16,32] |
measure to predict | Frags, health, ammo |
resolution of the raw input | (160px, 120px) |
training memory capacity | 20000 |
New steps per iteration | 64 |
batch size | 64 |
policy tested every | 1000 mini-batch iterations |
mini-batch iterations | 50000 |
test number of steps | 20000 |
vision models updating (if used) | 10 times |
We can make several comments from the observation of the figures above. Comparing the curves of reward and frags with and without intermediate representations, we notice several characteristics that allow to think the initial model is improved thanks to depth detection and segmentation :
- A better optimum is reached in both reward and frags when using vision.
- Learning is faster : the optimum is reached sooner when using vision.
- Learning is more stable : we observe less oscillation when using vision. Learning is smoother.
The following table summarizes the quantitative results of the experiments :
Frags | |
---|---|
No vision | Max : 10.5; Mean : 6.75 |
Vision and Groundtruth | Max : 12; Mean : 8.20 |
Vision and predictions | Max : 13.20; Mean : 7.82 |
- Computer vision does improve learning sensorimotor control from raw sensory input in complex and dynamic three-dimensional environments. Indeed, we have observed that these intermediate representations have multiple benefits on the agent : particularly, a better and more stable learning.
- The agent is significantly slower (one iteration takes more time) to learn when using vision.
- It is also slower to make decision : it might be a problem when it is used in real life speed.
- Improve the vision models : for instance, try to fine-tune pre-trained models.
- Test the power of generalisation of this model : try other maps (change textures for instance)
- Apply to other tasks : end-to-end learning for autonomous driving or navigation in indoor environments.
[1] Learning to Act by Predicting the Future, Alexey Dosovitskiy and Vladlen Koltun, International Conference on Learning Representations (ICLR), 2017
[2] O. Ronneberger, P. Fischer, T. Brox, U-net : Convolutional networks forbiomedical image segmentation,MICCAI(2015).
[3] Does Computer Vision Matter for Action ? , Brady Zhou, Philipp Krahenbühl, AND Vladlen Koltun