Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu
-
Introduce FeUdal Networks (FuNs).
-
Two levels of hierarchy:
- Manager (top level module): sets goals at a lower temporal resolution. It receives its learning signal from the environment alone.
- Worker (low level module): operates at a higher temporal resolution and produce primitive actions, conditioned on the goals received from the Manager. It is motivated by an intrinsic reward.
-
The figure above represents the architecture of the FuN model. Both modules share a perceptual module which takes the current observation xt and outputs an intermediate representation zt.
-
Both modules are recurrent.
-
φ represents a linear transform mapping the goal into an embedding vector wt.
FuN is fully differentiable and it could therefore be trained end-to-end using policy gradient algorithms operating on the actions taken by the Worker and propagating the gradients coming from the Worker back to the Manager. The authors, however, explain that this standard approach would deprive Manager's goals g of any semantic meaning, making them just internal latent variables of the model. For this reason, they propose a different training setup:
- Manager: train to predict advantageous directions in state space and to intrinsically reward the Worker to follow these directions. Update rule:
∇gt = AtM ∇θdcos(st+c-st, gt(θ))
where:
AtM = Rt - VtM(xt,θ) is the _Manager's_ advantage function
dcos(α,β) = αTβ/(|α||β|) is the cosine similarity between two vectors
gt acquires a semantic meaning as an advantageous direction in the state space at a horizon c.
- Worker: the worker is trained to maximize a weighted sum of external and intrinsic rewards: Rt + αRtI. The intrinsic reward encourages the Worker to follow the goal and is defined as:
rtI = 1/c Σi=1c dcos(st-st-i, gt-i(θ))
The authors evaluated the approach in several environments: Montezuma's revenge, other ATARI games and DeepMind's Labyrinth environment.
FuN achieves very good performance in most of the tested environments. However, one of the most interesting results, in my opinion, occurs when testing the model in Montezuma's revenge.
Although the current approach is quite similar to that one proposed by Kulkarni et al. and summarized here, the goals generated by FuN are very similar to those handcrafted in Kulkarni's paper. It is interesting to see that the model is capable of learning useful goals by itself.
- Deeper hierarchies by setting goals at multiple time scales.