ALGORITHMS DEVELOPMENT

Questions related to development of RL algorithms intended to solve btgym tasks.

Note: imitation, guided and meta- learning are active-research areas aimed to find well-generalizable and robust to task shifts policies; there is a fusion of model-free and model-based rl methods, representation techniques and implementation options; research topics outlined here may be closely interrelated.

State feature search:

Idea: Learn state space embedding with relevant features:

beta-VAE autoencoder -- learning disentangled generative factors (DARLA).
structuring convolution encoder to promote relevant features

Pros: can learn efficient general state representations → find generalizable policies

Cons: can learn irrelevant features; usually trained wit sq. error, but our specific is finding local price minima and maxima (+something else?);

Maybe:

construct additional domain-specific terms to encoder train loss to promote relevant features;
construct convolution encoder such as learnt features represent local price min/max;

Links:

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

beta-VAE paper, and its SCAN extension

and related Deep Variational Information Bottleneck paper;

Deep Spatial Autoencoders for Visuomotor Learning

Chelsea Finn CS294-112 lecture video - excellent topic intro

Aux: make learnt features visualization applet, add to tensorboard images: [simple solution from keras blog]

Reward shaping search:

Links :

CS 294-112 spring'17 Lecture 6 slides and video

Meta-learning:

MAML

Idea: MAML with asynchronous setup (i.e. A3C)

Pros: finding generalizable policy;

Cons: active research area; generic maml algorithm may not be scalable to our domain; may need some implementation tricks such as in here: One-Shot Visual Imitation Learning via Meta-Learning.

Links:

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)

Meta-SGD: Learning to Learn Quickly for Few-Shot Learning

Learning To Reinforcement Learn (RL^2 simple framework idea)

Learning to Learn: Meta-Critic Networks for Sample Efficient Learning

Meta-Learning with Temporal Convolutions

Learning to Generalize: Meta-Learning for Domain Generalization

Guided policy search + meta-learning:

Idea: fit local policies (for single or several episodes of data) and use it as an expert demonstrating correct actions. Use direct actions imitation loss. Can use it as meta-learning loss (~MAML) on target (~trial test) data;

Pros: speed-up learning process, cut off irrelevant policy space regions

Cons: need computation time to fit local models; Unclear: is it better or not to use direct actions imitation loss vs just testing model on target data (like in original MAML formulation); local model parameterizing choice?

Links:

Overcoming Exploration in Reinforcement Learning with Demonstrations

End-to-End Training of Deep Visuomotor Policies -- links to optimal control theory, notation shortlist