Tejas D. Kulkarni, Karthik R. Narasimhan, Ardavan Saeedi, Joshua B. Tenenbaum
-
Hierarchical-DQN (h-DQN): framework for integrating hierarchical value functions with intrinsically motivated deep reinforcement learning.
-
Agent is motivated to solve intrinsic goals (by learning skills) to improve exploration. These goals improve exploration and help alleviate sparse feedback problem. Moreover, goals can also help to significantly constrain the exploration space.
-
Learning modules working at different time-scales. Two levels of hierarchy:
- Meta controller (top level module): takes in the state and picks a new goal.
- Controller (low level module): uses both the state and the provided goal to select primitive actions either until the goal is reached or the episode terminates.
- Critic: There exists an internal critic responsible for evaluating whether the goal has been reached or not. An appropriate reward is given to the controller based on its actions.
-
Objective functions:
- Meta controller: to maximize the cumulative extrinsic reward received from the environment.
- Controller: to maximize the cumulative intrinsic reward provided by the critic.
-
The authors use the DQN framework to learn policies for both modules: Q1 and Q2. The main difference, apart from the reward, is the transitions generated by the policies:
- Meta controller (Q2): transition = (st, gt, ft, st+N), where f represents the external reward function and N denotes the number of time steps the controller employs to reach the current goal. These transitions run at a slower time-scale.
- Controller (Q1): transition = (st, at, gt, rt, st+1).
I am going to focus on the experiments they performed on the Atari game Montezuma's Revenge. This environment is challenging because it presents sparse delayed rewards.
-
In their setup, the authors built a custom object detector which provides plausible object candidates. These objects are then used as goals for the Controller. This object detector can be seen as a "hardcoded" customized module which provides the location of previously defined objects in the image like doors, ladders and keys.
-
One of these objects is chosen as a goal by the Meta controller and passed on to the Controller in the form of a binary mask of the goal location in image space.
-
The internal critic is defined in the space <entity1, relation, entity2>. It basically verifies if entity1 has reached entity2. One of this entities is provided in the form of a goal and the other one is selected by the agent (Controller).
h-DQN easily outperforms standard DQN approaches in the tested environments. The model gradually learn to select proper goals thus being able to solve the first stage in Montezuma's Revenge.
-
In order to scale-up and being able to solve the entire game, the system should be able to automatically identify or discover objects in the environment (new goals).
-
Memory is required for the model to remember the sequence of actions/goals it has performed.
-
An additional hierarchical level could be introduced below the current Controller in order to make use of Skills to solve the current problem (to reach the current goal).