Reinforcement learning environments for Torch7, inspired by RL-Glue [1] and conforming to the OpenAI Gym API [2]. Supported environments:
- rlenvs.Acrobot [3]
- rlenvs.Atari (Arcade Learning Environment)* [4]
- rlenvs.Blackjack [5]
- rlenvs.CartPole [6]
- rlenvs.Catch [7]
- rlenvs.CliffWalking [8]
- rlenvs.DynaMaze [9]
- rlenvs.GridWorld [10]
- rlenvs.JacksCarRental [8]
- rlenvs.Minecraft (Project Malmö)* [11]
- rlenvs.MountainCar [12]
- rlenvs.MultiArmedBandit [13, 14]
- rlenvs.RandomWalk [15]
- rlenvs.Taxi [16]
- rlenvs.WindyWorld [8]
- rlenvs.XOWorld [17]
Run th experiment.lua
(or qlua experiment.lua
) to run a demo of a random agent playing Catch.
* Environments with other dependencies are installed only if those dependencies are available.
luarocks install https://raw.githubusercontent.com/Kaixhin/rlenvs/master/rocks/rlenvs-scm-2.rockspec
The old API can be installed with the v1
rockspec:
luarocks install https://raw.githubusercontent.com/Kaixhin/rlenvs/master/rocks/rlenvs-scm-1.rockspec
luarocks install https://raw.githubusercontent.com/lake4790k/xitari/master/xitari-0-0.rockspec
luarocks install https://raw.githubusercontent.com/Kaixhin/alewrap/master/alewrap-0-0.rockspec
Requires a supported Atari ROM to run.
luarocks install luasocket
Requires Malmö (includes Minecraft), extracted with directory name MalmoPlatform
. libMalmoLua.so
should be added to LUA_CPATH
, and the level schemas should be exported to MALMO_XSD_PATH
. For example, if MalmoPlatform
is in /home/username
, add the following to the end of your ~/.bashrc
:
export LUA_CPATH='/home/username/MalmoPlatform/Torch_Examples/?.so;'$LUA_CPATH
export MALMO_XSD_PATH=/home/username/MalmoPlatform
The Malmö client (launchClient.sh
) must be operating to run.
To use an environment, require
it and then create a new instance:
local MountainCar = require 'rlenvs.MountainCar'
local env = MountainCar()
local observation = env:start()
Note that the API is under development and may be subject to change
A table of all environments available in rlenvs
.
Starts a new episode in the environment and returns the first observation
. May take opts
.
Note that environments must actually implement this as _start
.
Performs a step in the environment using action
(which may be a list - see below), and returns the reward
, the observation
of the state transitioned to, and a terminal
flag. Optionally provides actionTaken
, if the environment provides supervision in the form of the actual action taken by the agent in spite of the provided action.
Note that environments must actually implement this as _step
.
Returns a state specification as a list with 3 elements:
Type | Dimensionality | Range |
---|---|---|
'int' | 1 for a single value, or a table of dimensions for a Tensor | 2-element list with min and max values (inclusive) |
'real' | 1 for a single value, or a table of dimensions for a Tensor | 2-element list with min and max values (inclusive) |
'string' | TODO | List of accepted strings |
If several states are returned, stateSpec
is itself a list of state specifications. Ranges may use nil
if unknown.
Returns an action specification, with the same structure as used for state specifications.
Returns the minimum and maximum rewards produced by the environment. Values may be nil
if unknown.
The following are optional parts of the API.
Changes settings for a "training mode", analogous to neural network modules.
Changes settings for an "evaluation mode", analogous to neural network modules.
Returns an RGB display specification, with the same structure as used for state specifications. Hence of the form {<int/real>, {3, <height>, <width>}, {<range>}}
.
Returns a RGB display tensor for visualising the state of the environment. Note that this may not be the same as the state provided for the agent.
Displays the environment using image
. Requires the code to be run with qlua
(rather than th
) and getDisplay
to be implemented by the environment.
Environments must inherit from Env
and therefore implement the above methods (as well as a constructor). experiment.lua
can be easily adapted for testing different environments. New environments should be added to rlenvs/init.lua
, rocks/rlenvs-scm-1.rockspec
, and be listed in this readme with an appropriate reference. For an example of a more complex environment that will only be installed if its optional dependencies are satisfied, see rlenvs/Atari.lua
.
[1] Tanner, B., & White, A. (2009). RL-Glue: Language-independent software for reinforcement-learning experiments. The Journal of Machine Learning Research, 10, 2133-2136.
[2] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym. arXiv preprint arXiv:1606.01540.
[3] DeJong, G., & Spong, M. W. (1994, June). Swinging up the acrobot: An example of intelligent control. In American Control Conference, 1994 (Vol. 2, pp. 2158-2162). IEEE.
[4] Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2012). The arcade learning environment. Journal of Artificial Intelligence Research, 47, 253-279.
[5] Pérez-Uribe, A., & Sanchez, E. (1998, May). Blackjack as a test bed for learning strategies in neural networks. In Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on (Vol. 3, pp. 2022-2027). IEEE.
[6] Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. Systems, Man and Cybernetics, IEEE Transactions on, (5), 834-846.
[7] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in Neural Information Processing Systems (pp. 2204-2212).
[8] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1, No. 1). Cambridge: MIT press.
[9] Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning (pp. 216-224).
[10] Boyan, J., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. Advances in Neural Information Processing Systems, 369-376.
[11] Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016). The Malmo platform for artificial intelligence experimentation. In International Joint Conference on Artificial Intelligence.
[12] Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1-3), 123-158.
[13] Robbins, H. (1985). Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers (pp. 169-177). Springer New York.
[14] Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied probability, 287-298.
[15] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44.
[16] Dietterich, T. G. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. In Journal of Artificial Intelligence Research.
[17] Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards Deep Symbolic Reinforcement Learning. In Workshop on Deep Reinforcement Learning, NIPS 2016.