Kaixhin · SeanNaren · Dec 15, 2016 · Nov 14, 2016 · Nov 14, 2016 · Nov 15, 2016
diff --git a/README.md b/README.md
@@ -1,23 +1,23 @@
 # rlenvs
 
-Reinforcement learning environments for Torch7, inspired by RL-Glue [[1]](#references). Supported environments:
-
-- rlenvs.Acrobot [[2]](#references)
-- rlenvs.Atari (Arcade Learning Environment)\* [[3]](#references)
-- rlenvs.Blackjack [[4]](#references)
-- rlenvs.CartPole [[5]](#references)
-- rlenvs.Catch [[6]](#references)
-- rlenvs.CliffWalking [[7]](#references)
-- rlenvs.DynaMaze [[8]](#references)
-- rlenvs.GridWorld [[9]](#references)
-- rlenvs.JacksCarRental [[7]](#references)
-- rlenvs.Minecraft (Project Malmö)\* [[10]](#references)
-- rlenvs.MountainCar [[11]](#references)
-- rlenvs.MultiArmedBandit [[12, 13]](#references)
-- rlenvs.RandomWalk [[14]](#references)
-- rlenvs.Taxi [[15]](#references)
-- rlenvs.WindyWorld [[7]](#references)
-- rlenvs.XOWorld [[16]](#references)
+Reinforcement learning environments for Torch7, inspired by [RL-Glue](http://glue.rl-community.org/wiki/Main_Page) [[1]](#references) and conforming to the [OpenAI Gym API](https://gym.openai.com/docs) [[2]](#references). Supported environments:
+
+- rlenvs.Acrobot [[3]](#references)
+- rlenvs.Atari (Arcade Learning Environment)\* [[4]](#references)
+- rlenvs.Blackjack [[5]](#references)
+- rlenvs.CartPole [[6]](#references)
+- rlenvs.Catch [[7]](#references)
+- rlenvs.CliffWalking [[8]](#references)
+- rlenvs.DynaMaze [[9]](#references)
+- rlenvs.GridWorld [[10]](#references)
+- rlenvs.JacksCarRental [[8]](#references)
+- rlenvs.Minecraft (Project Malmö)\* [[11]](#references)
+- rlenvs.MountainCar [[12]](#references)
+- rlenvs.MultiArmedBandit [[13, 14]](#references)
+- rlenvs.RandomWalk [[15]](#references)
+- rlenvs.Taxi [[16]](#references)
+- rlenvs.WindyWorld [[8]](#references)
+- rlenvs.XOWorld [[17]](#references)
 
 Run `th experiment.lua` (or `qlua experiment.lua`) to run a demo of a random agent playing Catch.
 
@@ -44,10 +44,11 @@ Requires a [supported](https://github.com/Kaixhin/Atari/blob/master/roms/README.
 luarocks install luasocket
 ```
 
-Requires [Malmö](https://github.com/Microsoft/malmo) (includes Minecraft), extracted with directory name `MalmoPlatform`. `libMalmoLua.so` should be added to `LUA_CPATH`. For example, if `MalmoPlatform` is in your home directory, add the following to the end of your `~/.bashrc`:
+Requires [Malmö](https://github.com/Microsoft/malmo) (includes Minecraft), extracted with directory name `MalmoPlatform`. `libMalmoLua.so` should be added to `LUA_CPATH`, and the level schemas should be exported to `MALMO_XSD_PATH`. For example, if `MalmoPlatform` is in `/home/username`, add the following to the end of your `~/.bashrc`:
 
 ```sh
-export LUA_CPATH=~/MalmoPlatform/Torch_Examples/libMalmoLua.so;$LUA_CPATH
+export LUA_CPATH='/home/username/MalmoPlatform/Torch_Examples/?.so;'$LUA_CPATH
+export MALMO_XSD_PATH=/home/username/MalmoPlatform
 ```
 
 The Malmö client (`launchClient.sh`) must be operating to run.
@@ -66,15 +67,21 @@ local observation = env:start()
 
 **Note that the API is under development and may be subject to change**
 
+### rlenvs.envs
+
+A table of all environments available in `rlenvs`.
+
 ### observation = env:start([opts])
 
-Starts a new episode in the environment and returns the first `observation`. May take `opts`.
+Starts a new episode in the environment and returns the first `observation`. May take `opts`.  
+Note that environments must actually implement this as `_start`.
 
 ### reward, observation, terminal, [actionTaken] = env:step(action)
 
-Performs a step in the environment using `action` (which may be a list - see below), and returns the `reward`, the `observation` of the state transitioned to, and a `terminal` flag. Optionally provides `actionTaken`, if the environment provides supervision in the form of the actual action taken by the agent in spite of the provided action.
+Performs a step in the environment using `action` (which may be a list - see below), and returns the `reward`, the `observation` of the state transitioned to, and a `terminal` flag. Optionally provides `actionTaken`, if the environment provides supervision in the form of the actual action taken by the agent in spite of the provided action.  
+Note that environments must actually implement this as `_step`.
 
-### stateSpec = env:getStateSpec()
+### stateSpace = env:getStateSpace()
 
 Returns a state specification as a list with 3 elements:
 
@@ -86,11 +93,11 @@ Returns a state specification as a list with 3 elements:
 
 If several states are returned, `stateSpec` is itself a list of state specifications. Ranges may use `nil` if unknown.
 
-### actionSpec = env:getActionSpec()
+### actionSpace = env:getActionSpace()
 
 Returns an action specification, with the same structure as used for state specifications.
 
-### minReward, maxReward = env:getRewardSpec()
+### minReward, maxReward = env:getRewardSpace()
 
 Returns the minimum and maximum rewards produced by the environment. Values may be `nil` if unknown.
 
@@ -114,25 +121,30 @@ Returns an RGB display specification, with the same structure as used for state
 
 Returns a RGB display tensor for visualising the state of the environment. Note that this may not be the same as the state provided for the agent.
 
+### env:render()
+
+Displays the environment using `image`. Requires the code to be run with `qlua` (rather than `th`) and `getDisplay` to be implemented by the environment.
+
 ## Development
 
 Environments must inherit from `Env` and therefore implement the above methods (as well as a constructor). `experiment.lua` can be easily adapted for testing different environments. New environments should be added to `rlenvs/init.lua`, `rocks/rlenvs-scm-1.rockspec`, and be listed in this readme with an appropriate reference. For an example of a more complex environment that will only be installed if its optional dependencies are satisfied, see `rlenvs/Atari.lua`.
 
 ## References
 
 [1] Tanner, B., & White, A. (2009). RL-Glue: Language-independent software for reinforcement-learning experiments. *The Journal of Machine Learning Research, 10*, 2133-2136.  
-[2] DeJong, G., & Spong, M. W. (1994, June). Swinging up the acrobot: An example of intelligent control. In *American Control Conference, 1994* (Vol. 2, pp. 2158-2162). IEEE.  
-[3] Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2012). The arcade learning environment. *J. Artificial Intelligence Res, 47*, 253-279.  
-[4] Pérez-Uribe, A., & Sanchez, E. (1998, May). Blackjack as a test bed for learning strategies in neural networks. In *Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on* (Vol. 3, pp. 2022-2027). IEEE.  
-[5] Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. *Systems, Man and Cybernetics, IEEE Transactions on*, (5), 834-846.  
-[6] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In *Advances in Neural Information Processing Systems* (pp. 2204-2212).  
-[7] Sutton, R. S., & Barto, A. G. (1998). *Reinforcement learning: An introduction* (Vol. 1, No. 1). Cambridge: MIT press.  
-[8] Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In *Proceedings of the seventh international conference on machine learning* (pp. 216-224).  
-[9] Boyan, J., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. *Advances in neural information processing systems*, 369-376.  
-[10] Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016). The Malmo platform for artificial intelligence experimentation. In *International joint conference on artificial intelligence (IJCAI)*.  
-[11] Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. *Machine learning, 22*(1-3), 123-158.  
-[12] Robbins, H. (1985). Some aspects of the sequential design of experiments. In *Herbert Robbins Selected Papers* (pp. 169-177). Springer New York.  
-[13] Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. *Journal of applied probability*, 287-298.  
-[14] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. *Machine learning, 3*(1), 9-44.  
-[15] Dietterich, T. G. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. In *Journal of Artificial Intelligence Research*.  
-[16] Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards Deep Symbolic Reinforcement Learning. *arXiv preprint arXiv:1609.05518*.  
+[2] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym. *arXiv preprint arXiv:1606.01540*.  
+[3] DeJong, G., & Spong, M. W. (1994, June). Swinging up the acrobot: An example of intelligent control. In *American Control Conference, 1994* (Vol. 2, pp. 2158-2162). IEEE.  
+[4] Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2012). The arcade learning environment. *Journal of Artificial Intelligence Research, 47*, 253-279.  
+[5] Pérez-Uribe, A., & Sanchez, E. (1998, May). Blackjack as a test bed for learning strategies in neural networks. In *Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on* (Vol. 3, pp. 2022-2027). IEEE.  
+[6] Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. *Systems, Man and Cybernetics, IEEE Transactions on*, (5), 834-846.  
+[7] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In *Advances in Neural Information Processing Systems* (pp. 2204-2212).  
+[8] Sutton, R. S., & Barto, A. G. (1998). *Reinforcement learning: An introduction* (Vol. 1, No. 1). Cambridge: MIT press.  
+[9] Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In *Proceedings of the Seventh International Conference on Machine Learning* (pp. 216-224).  
+[10] Boyan, J., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. *Advances in Neural Information Processing Systems*, 369-376.  
+[11] Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016). The Malmo platform for artificial intelligence experimentation. In *International Joint Conference on Artificial Intelligence*.  
+[12] Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. *Machine Learning, 22*(1-3), 123-158.  
+[13] Robbins, H. (1985). Some aspects of the sequential design of experiments. In *Herbert Robbins Selected Papers* (pp. 169-177). Springer New York.  
+[14] Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. *Journal of Applied probability*, 287-298.  
+[15] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. *Machine Learning, 3*(1), 9-44.  
+[16] Dietterich, T. G. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. In *Journal of Artificial Intelligence Research*.  
+[17] Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards Deep Symbolic Reinforcement Learning. In *Workshop on Deep Reinforcement Learning, NIPS 2016*.
diff --git a/experiment.lua b/experiment.lua
@@ -1,38 +1,31 @@
-local image = require 'image'
-local Catch = require 'rlenvs/Catch'
-
--- Detect QT for image display
-local qt = pcall(require, 'qt')
+local Catch = require 'rlenvs.Catch'
 
 -- Initialise and start environment
-local env = Catch({level = 2})
-local stateSpec = env:getStateSpec()
-local actionSpec = env:getActionSpec()
+local env = Catch({level = 2, render = true, zoom = 10})
+local actionSpace = env:getActionSpace()
 local observation = env:start()
 
-local reward, terminal
+local reward, terminal = 0, false
 local episodes, totalReward = 0, 0
-local nSteps = 1000 * (stateSpec[2][2] - 1) -- Run for 1000 episodes
+local nEpisodes = 1000
 
 -- Display
-local window = qt and image.display({image=observation, zoom=10})
+env:render()
 
-for i = 1, nSteps do
-  -- Pick random action and execute it
-  local action = torch.random(actionSpec[3][1], actionSpec[3][2])
-  reward, observation, terminal = env:step(action)
-  totalReward = totalReward + reward
+for i = 1, nEpisodes do
+  while not terminal do
+    -- Pick random action and execute it
+    local action = torch.random(0, actionSpace['n'] - 1)
+    reward, observation, terminal = env:step(action)
+    totalReward = totalReward + reward
 
-  -- Display
-  if qt then
-    image.display({image=observation, zoom=10, win=window})
+    -- Display
+    env:render()
   end
 
-  -- If game finished, start again
-  if terminal then
-    episodes = episodes + 1
-    observation = env:start()
-  end
+  episodes = episodes + 1
+  observation = env:start()
+  terminal = false
 end
 print('Episodes: ' .. episodes)
 print('Total Reward: ' .. totalReward)
diff --git a/rlenvs/Acrobot.lua b/rlenvs/Acrobot.lua
@@ -1,11 +1,14 @@
 local classic = require 'classic'
 
 local Acrobot, super = classic.class('Acrobot', Env)
+Acrobot.timeStepLimit = 500
 
 -- Constructor
 function Acrobot:_init(opts)
   opts = opts or {}
-
+  opts.timeStepLimit = Acrobot.timeStepLimit
+  super._init(self, opts)
+
   -- Constants
   self.g = opts.g or 9.8
   self.m1 = opts.m1 or 1 -- Mass of link 1
@@ -21,27 +24,40 @@ function Acrobot:_init(opts)
 end
 
 -- 4 states returned, of type 'real', of dimensionality 1, with differing ranges
-function Acrobot:getStateSpec()
-  return {
-    {'real', 1, {-math.pi, math.pi}}, -- Joint 1 angle
-    {'real', 1, {-math.pi, math.pi}}, -- Joint 2 angle
-    {'real', 1, {-4*math.pi, 4*math.pi}}, -- Joint 1 angular velocity
-    {'real', 1, {-9*math.pi, 9*math.pi}} -- Joint 2 angular velocity
+function Acrobot:getStateSpace()
+  local state = {}
+  state['name'] = 'Box'
+  state['shape'] = {4}
+  state['low'] = {
+    -math.pi, -- Joint 1 angle
+    -math.pi, -- Joint 2 angle
+    -4 * math.pi, -- Joint 1 angular velocity
+    -9 * math.pi -- Joint 2 angular velocity
+  }
+  state['high'] = {
+    math.pi, -- Joint 1 angle
+    math.pi, -- Joint 2 angle
+    4 * math.pi, -- Joint 1 angular velocity
+    9 * math.pi -- Joint 2 angular velocity
   }
+  return state
 end
 
 -- 1 action required, of type 'int', of dimensionality 1, with second torque joint in {-1, 0, 1}
-function Acrobot:getActionSpec()
-  return {'int', 1, {-1, 1}}
+function Acrobot:getActionSpace()
+  local action = {}
+  action['name'] = 'Discrete'
+  action['n'] = 3
+  return action
 end
 
 -- Min and max reward
-function Acrobot:getRewardSpec()
+function Acrobot:getRewardSpace()
   return -1, 0
 end
 
 -- Resets the cart
-function Acrobot:start()
+function Acrobot:_start()
   -- Reset angles and velocities
   self.q1 = 0 -- Joint 1 angle
   self.q2 = 0 -- Joint 2 angle
@@ -52,20 +68,19 @@ function Acrobot:start()
 end
 
 -- Swings the pole via torque on second joint
-function Acrobot:step(action)
+function Acrobot:_step(action)
+  action = action - 1 -- rescale the action
   local reward = -1
   local terminal = false
 
   for t = 1, self.steps do
     -- Calculate motion of system
-    local d1 = self.m1*math.pow(self.lc1, 2) + self.m2*(math.pow(self.l1, 2) + math.pow(self.lc2, 2) + 2*self.l1*self.lc2*math.cos(self.q2)) + self.I1 + self.I2
-    local d2 = self.m2*(math.pow(self.lc2, 2) + self.l1*self.lc2*math.cos(self.q2)) + self.I2
-    local phi2 = self.m2*self.lc2*self.g*math.cos(self.q1 + self.q2 - math.pi/2)
-    local phi1 = -self.m2*self.l1*self.lc2*math.pow(self.q2Dot, 2)*math.sin(self.q2) - 2*self.m2*self.l1*self.lc2*self.q2Dot*self.q1Dot*math.sin(self.q2) +
-                 (self.m1*self.lc1 + self.m2*self.l1)*self.g*math.cos(self.q1 - math.pi/2) + phi2
-    local q2DotDot = (action + d2/d1*phi1 - self.m2*self.l1*self.lc2*math.pow(self.q1Dot, 2)*math.sin(self.q2) - phi2) /
-                     (self.m2*math.pow(self.lc2, 2) + self.I2 - math.pow(d2, 2)/d1)
-    local q1DotDot = -(d2/q2DotDot + phi1)/d1
+    local d1 = self.m1 * math.pow(self.lc1, 2) + self.m2 * (math.pow(self.l1, 2) + math.pow(self.lc2, 2) + 2 * self.l1 * self.lc2 * math.cos(self.q2)) + self.I1 + self.I2
+    local d2 = self.m2 * (math.pow(self.lc2, 2) + self.l1 * self.lc2 * math.cos(self.q2)) + self.I2
+    local phi2 = self.m2 * self.lc2 * self.g * math.cos(self.q1 + self.q2 - math.pi/2)
+    local phi1 = -self.m2 * self.l1 * self.lc2 * math.pow(self.q2Dot, 2) * math.sin(self.q2) - 2 * self.m2 * self.l1 * self.lc2 * self.q2Dot * self.q1Dot * math.sin(self.q2) + (self.m1 * self.lc1 + self.m2 * self.l1) * self.g * math.cos(self.q1 - math.pi / 2) + phi2
+    local q2DotDot = (action + d2 / d1 * phi1 - self.m2 * self.l1 * self.lc2 * math.pow(self.q1Dot, 2) * math.sin(self.q2) - phi2) / (self.m2 * math.pow(self.lc2, 2) + self.I2 - math.pow(d2, 2) / d1)
+    local q1DotDot = -(d2 / q2DotDot + phi1) / d1
 
     -- Update state using Euler's method
     self.q1Dot = self.q1Dot + self.tau * q1DotDot
@@ -86,13 +101,13 @@ function Acrobot:step(action)
     self.q2 = math.pi - (self.q2 % -math.pi)
   end
   -- Limit velocities
-  self.q1Dot = math.max(self.q1Dot, -4*math.pi)
-  self.q1Dot = math.min(self.q1Dot, 4*math.pi)
-  self.q2Dot = math.max(self.q2Dot, -9*math.pi)
-  self.q2Dot = math.min(self.q2Dot, 9*math.pi)
+  self.q1Dot = math.max(self.q1Dot, -4 * math.pi)
+  self.q1Dot = math.min(self.q1Dot, 4 * math.pi)
+  self.q2Dot = math.max(self.q2Dot, -9 * math.pi)
+  self.q2Dot = math.min(self.q2Dot, 9 * math.pi)
 
   -- Terminate if second joint's height is greater than height of first joint (relative to origin)
-  local h = -self.l1*math.cos(self.q1) - self.l2*math.sin(math.pi/2 - self.q1 - self.q2)
+  local h = -self.l1 * math.cos(self.q1) - self.l2 * math.sin(math.pi / 2 - self.q1 - self.q2)
   if h > self.l1 then
     reward = 0
     terminal = true

diff --git a/rlenvs/Atari.lua b/rlenvs/Atari.lua
@@ -6,11 +6,15 @@ if not hasALEWrap then
 end
 
 local Atari, super = classic.class('Atari', Env)
+Atari.timeStepLimit = 100000
 
 -- Constructor
 function Atari:_init(opts)
   -- Create ALEWrap options from opts
   opts = opts or {}
+  opts.timeStepLimit = Atari.timeStepLimit
+  super._init(self, opts)
+
   if opts.lifeLossTerminal == nil then
     opts.lifeLossTerminal = true
   end
@@ -44,13 +48,25 @@ function Atari:_init(opts)
 end
 
 -- 1 state returned, of type 'real', of dimensionality 3 x 210 x 160, between 0 and 1
-function Atari:getStateSpec()
-  return {'real', {3, 210, 160}, {0, 1}}
+function Atari:getStateSpace()
+  local state = {}
+  state['name'] = 'Box'
+  state['shape'] = {3, 210, 160}
+  state['low'] = {
+    0
+  }
+  state['high'] = {
+    1
+  }
+  return state
 end
 
 -- 1 action required, of type 'int', of dimensionality 1, between 1 and 18 (max)
-function Atari:getActionSpec()
-  return {'int', 1, {1, #self.actions}}
+function Atari:getActionSpace()
+  local action = {}
+  action['name'] = 'Discrete'
+  action['n'] = #self.actions
+  return action
 end
 
 -- RGB screen of height 210 and width 160
@@ -59,12 +75,12 @@ function Atari:getDisplaySpec()
 end
 
 -- Min and max reward (unknown)
-function Atari:getRewardSpec()
+function Atari:getRewardSpace()
   return nil, nil
 end
 
 -- Starts a new game, possibly with a random number of no-ops
-function Atari:start()
+function Atari:_start()
   local screen, reward, terminal
 
   if self.gameEnv._random_starts > 0 then
@@ -77,7 +93,7 @@ function Atari:start()
 end
 
 -- Steps in a game
-function Atari:step(action)
+function Atari:_step(action)
   -- Map action index to action for game
   action = self.actions[action]