FeUdal Networks for Hierarchical Reinforcement Learning

Authors:

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu

Summary:

Introduce FeUdal Networks (FuNs).
Two levels of hierarchy:
1. Manager (top level module): sets goals at a lower temporal resolution. It receives its learning signal from the environment alone.
2. Worker (low level module): operates at a higher temporal resolution and produce primitive actions, conditioned on the goals received from the Manager. It is motivated by an intrinsic reward.

The figure above represents the architecture of the FuN model. Both modules share a perceptual module which takes the current observation x_t and outputs an intermediate representation z_t.
Both modules are recurrent.
φ represents a linear transform mapping the goal into an embedding vector w_t.

Learning:

FuN is fully differentiable and it could therefore be trained end-to-end using policy gradient algorithms operating on the actions taken by the Worker and propagating the gradients coming from the Worker back to the Manager. The authors, however, explain that this standard approach would deprive Manager's goals g of any semantic meaning, making them just internal latent variables of the model. For this reason, they propose a different training setup:

Manager: train to predict advantageous directions in state space and to intrinsically reward the Worker to follow these directions. Update rule:

∇g_t = A_t^M ∇_θd_cos(s_t+c-s_t, g_t(θ))

where:

A_t^M = R_t - V_t^M(x_t,θ) is the _Manager's_ advantage function

d_cos(α,β) = α^Tβ/(|α||β|) is the cosine similarity between two vectors

g_t acquires a semantic meaning as an advantageous direction in the state space at a horizon c.

Worker: the worker is trained to maximize a weighted sum of external and intrinsic rewards: R_t + αR_t^I. The intrinsic reward encourages the Worker to follow the goal and is defined as:

r_t^I = 1/c Σ_i=1^c d_cos(s_t-s_t-i, g_t-i(θ))

Experiments:

The authors evaluated the approach in several environments: Montezuma's revenge, other ATARI games and DeepMind's Labyrinth environment.

Results:

FuN achieves very good performance in most of the tested environments. However, one of the most interesting results, in my opinion, occurs when testing the model in Montezuma's revenge.

Although the current approach is quite similar to that one proposed by Kulkarni et al. and summarized here, the goals generated by FuN are very similar to those handcrafted in Kulkarni's paper. It is interesting to see that the model is capable of learning useful goals by itself.

Future work:

Deeper hierarchies by setting goals at multiple time scales.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FeUdal.md

FeUdal.md

FeUdal Networks for Hierarchical Reinforcement Learning

Authors:

Summary:

Learning:

Experiments:

Results:

Future work:

Files

FeUdal.md

Latest commit

History

FeUdal.md

File metadata and controls

FeUdal Networks for Hierarchical Reinforcement Learning

Authors:

Summary:

Learning:

Experiments:

Results:

Future work: