Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twrl-rebased #14

Merged
merged 11 commits into from
Dec 15, 2016
92 changes: 52 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
# rlenvs

Reinforcement learning environments for Torch7, inspired by RL-Glue [[1]](#references). Supported environments:

- rlenvs.Acrobot [[2]](#references)
- rlenvs.Atari (Arcade Learning Environment)\* [[3]](#references)
- rlenvs.Blackjack [[4]](#references)
- rlenvs.CartPole [[5]](#references)
- rlenvs.Catch [[6]](#references)
- rlenvs.CliffWalking [[7]](#references)
- rlenvs.DynaMaze [[8]](#references)
- rlenvs.GridWorld [[9]](#references)
- rlenvs.JacksCarRental [[7]](#references)
- rlenvs.Minecraft (Project Malmö)\* [[10]](#references)
- rlenvs.MountainCar [[11]](#references)
- rlenvs.MultiArmedBandit [[12, 13]](#references)
- rlenvs.RandomWalk [[14]](#references)
- rlenvs.Taxi [[15]](#references)
- rlenvs.WindyWorld [[7]](#references)
- rlenvs.XOWorld [[16]](#references)
Reinforcement learning environments for Torch7, inspired by [RL-Glue](http://glue.rl-community.org/wiki/Main_Page) [[1]](#references) and conforming to the [OpenAI Gym API](https://gym.openai.com/docs) [[2]](#references). Supported environments:

- rlenvs.Acrobot [[3]](#references)
- rlenvs.Atari (Arcade Learning Environment)\* [[4]](#references)
- rlenvs.Blackjack [[5]](#references)
- rlenvs.CartPole [[6]](#references)
- rlenvs.Catch [[7]](#references)
- rlenvs.CliffWalking [[8]](#references)
- rlenvs.DynaMaze [[9]](#references)
- rlenvs.GridWorld [[10]](#references)
- rlenvs.JacksCarRental [[8]](#references)
- rlenvs.Minecraft (Project Malmö)\* [[11]](#references)
- rlenvs.MountainCar [[12]](#references)
- rlenvs.MultiArmedBandit [[13, 14]](#references)
- rlenvs.RandomWalk [[15]](#references)
- rlenvs.Taxi [[16]](#references)
- rlenvs.WindyWorld [[8]](#references)
- rlenvs.XOWorld [[17]](#references)

Run `th experiment.lua` (or `qlua experiment.lua`) to run a demo of a random agent playing Catch.

Expand All @@ -44,10 +44,11 @@ Requires a [supported](https://github.com/Kaixhin/Atari/blob/master/roms/README.
luarocks install luasocket
```

Requires [Malmö](https://github.com/Microsoft/malmo) (includes Minecraft), extracted with directory name `MalmoPlatform`. `libMalmoLua.so` should be added to `LUA_CPATH`. For example, if `MalmoPlatform` is in your home directory, add the following to the end of your `~/.bashrc`:
Requires [Malmö](https://github.com/Microsoft/malmo) (includes Minecraft), extracted with directory name `MalmoPlatform`. `libMalmoLua.so` should be added to `LUA_CPATH`, and the level schemas should be exported to `MALMO_XSD_PATH`. For example, if `MalmoPlatform` is in `/home/username`, add the following to the end of your `~/.bashrc`:

```sh
export LUA_CPATH=~/MalmoPlatform/Torch_Examples/libMalmoLua.so;$LUA_CPATH
export LUA_CPATH='/home/username/MalmoPlatform/Torch_Examples/?.so;'$LUA_CPATH
export MALMO_XSD_PATH=/home/username/MalmoPlatform
```

The Malmö client (`launchClient.sh`) must be operating to run.
Expand All @@ -66,15 +67,21 @@ local observation = env:start()

**Note that the API is under development and may be subject to change**

### rlenvs.envs

A table of all environments available in `rlenvs`.

### observation = env:start([opts])

Starts a new episode in the environment and returns the first `observation`. May take `opts`.
Starts a new episode in the environment and returns the first `observation`. May take `opts`.
Note that environments must actually implement this as `_start`.

### reward, observation, terminal, [actionTaken] = env:step(action)

Performs a step in the environment using `action` (which may be a list - see below), and returns the `reward`, the `observation` of the state transitioned to, and a `terminal` flag. Optionally provides `actionTaken`, if the environment provides supervision in the form of the actual action taken by the agent in spite of the provided action.
Performs a step in the environment using `action` (which may be a list - see below), and returns the `reward`, the `observation` of the state transitioned to, and a `terminal` flag. Optionally provides `actionTaken`, if the environment provides supervision in the form of the actual action taken by the agent in spite of the provided action.
Note that environments must actually implement this as `_step`.

### stateSpec = env:getStateSpec()
### stateSpace = env:getStateSpace()

Returns a state specification as a list with 3 elements:

Expand All @@ -86,11 +93,11 @@ Returns a state specification as a list with 3 elements:

If several states are returned, `stateSpec` is itself a list of state specifications. Ranges may use `nil` if unknown.

### actionSpec = env:getActionSpec()
### actionSpace = env:getActionSpace()

Returns an action specification, with the same structure as used for state specifications.

### minReward, maxReward = env:getRewardSpec()
### minReward, maxReward = env:getRewardSpace()

Returns the minimum and maximum rewards produced by the environment. Values may be `nil` if unknown.

Expand All @@ -114,25 +121,30 @@ Returns an RGB display specification, with the same structure as used for state

Returns a RGB display tensor for visualising the state of the environment. Note that this may not be the same as the state provided for the agent.

### env:render()

Displays the environment using `image`. Requires the code to be run with `qlua` (rather than `th`) and `getDisplay` to be implemented by the environment.

## Development

Environments must inherit from `Env` and therefore implement the above methods (as well as a constructor). `experiment.lua` can be easily adapted for testing different environments. New environments should be added to `rlenvs/init.lua`, `rocks/rlenvs-scm-1.rockspec`, and be listed in this readme with an appropriate reference. For an example of a more complex environment that will only be installed if its optional dependencies are satisfied, see `rlenvs/Atari.lua`.

## References

[1] Tanner, B., & White, A. (2009). RL-Glue: Language-independent software for reinforcement-learning experiments. *The Journal of Machine Learning Research, 10*, 2133-2136.
[2] DeJong, G., & Spong, M. W. (1994, June). Swinging up the acrobot: An example of intelligent control. In *American Control Conference, 1994* (Vol. 2, pp. 2158-2162). IEEE.
[3] Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2012). The arcade learning environment. *J. Artificial Intelligence Res, 47*, 253-279.
[4] Pérez-Uribe, A., & Sanchez, E. (1998, May). Blackjack as a test bed for learning strategies in neural networks. In *Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on* (Vol. 3, pp. 2022-2027). IEEE.
[5] Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. *Systems, Man and Cybernetics, IEEE Transactions on*, (5), 834-846.
[6] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In *Advances in Neural Information Processing Systems* (pp. 2204-2212).
[7] Sutton, R. S., & Barto, A. G. (1998). *Reinforcement learning: An introduction* (Vol. 1, No. 1). Cambridge: MIT press.
[8] Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In *Proceedings of the seventh international conference on machine learning* (pp. 216-224).
[9] Boyan, J., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. *Advances in neural information processing systems*, 369-376.
[10] Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016). The Malmo platform for artificial intelligence experimentation. In *International joint conference on artificial intelligence (IJCAI)*.
[11] Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. *Machine learning, 22*(1-3), 123-158.
[12] Robbins, H. (1985). Some aspects of the sequential design of experiments. In *Herbert Robbins Selected Papers* (pp. 169-177). Springer New York.
[13] Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. *Journal of applied probability*, 287-298.
[14] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. *Machine learning, 3*(1), 9-44.
[15] Dietterich, T. G. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. In *Journal of Artificial Intelligence Research*.
[16] Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards Deep Symbolic Reinforcement Learning. *arXiv preprint arXiv:1609.05518*.
[2] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym. *arXiv preprint arXiv:1606.01540*.
[3] DeJong, G., & Spong, M. W. (1994, June). Swinging up the acrobot: An example of intelligent control. In *American Control Conference, 1994* (Vol. 2, pp. 2158-2162). IEEE.
[4] Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2012). The arcade learning environment. *Journal of Artificial Intelligence Research, 47*, 253-279.
[5] Pérez-Uribe, A., & Sanchez, E. (1998, May). Blackjack as a test bed for learning strategies in neural networks. In *Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on* (Vol. 3, pp. 2022-2027). IEEE.
[6] Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. *Systems, Man and Cybernetics, IEEE Transactions on*, (5), 834-846.
[7] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In *Advances in Neural Information Processing Systems* (pp. 2204-2212).
[8] Sutton, R. S., & Barto, A. G. (1998). *Reinforcement learning: An introduction* (Vol. 1, No. 1). Cambridge: MIT press.
[9] Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In *Proceedings of the Seventh International Conference on Machine Learning* (pp. 216-224).
[10] Boyan, J., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. *Advances in Neural Information Processing Systems*, 369-376.
[11] Johnson, M., Hofmann, K., Hutton, T., & Bignell, D. (2016). The Malmo platform for artificial intelligence experimentation. In *International Joint Conference on Artificial Intelligence*.
[12] Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. *Machine Learning, 22*(1-3), 123-158.
[13] Robbins, H. (1985). Some aspects of the sequential design of experiments. In *Herbert Robbins Selected Papers* (pp. 169-177). Springer New York.
[14] Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. *Journal of Applied probability*, 287-298.
[15] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. *Machine Learning, 3*(1), 9-44.
[16] Dietterich, T. G. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. In *Journal of Artificial Intelligence Research*.
[17] Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards Deep Symbolic Reinforcement Learning. In *Workshop on Deep Reinforcement Learning, NIPS 2016*.
41 changes: 17 additions & 24 deletions experiment.lua
Original file line number Diff line number Diff line change
@@ -1,38 +1,31 @@
local image = require 'image'
local Catch = require 'rlenvs/Catch'

-- Detect QT for image display
local qt = pcall(require, 'qt')
local Catch = require 'rlenvs.Catch'

-- Initialise and start environment
local env = Catch({level = 2})
local stateSpec = env:getStateSpec()
local actionSpec = env:getActionSpec()
local env = Catch({level = 2, render = true, zoom = 10})
local actionSpace = env:getActionSpace()
local observation = env:start()

local reward, terminal
local reward, terminal = 0, false
local episodes, totalReward = 0, 0
local nSteps = 1000 * (stateSpec[2][2] - 1) -- Run for 1000 episodes
local nEpisodes = 1000

-- Display
local window = qt and image.display({image=observation, zoom=10})
env:render()

for i = 1, nSteps do
-- Pick random action and execute it
local action = torch.random(actionSpec[3][1], actionSpec[3][2])
reward, observation, terminal = env:step(action)
totalReward = totalReward + reward
for i = 1, nEpisodes do
while not terminal do
-- Pick random action and execute it
local action = torch.random(0, actionSpace['n'] - 1)
reward, observation, terminal = env:step(action)
totalReward = totalReward + reward

-- Display
if qt then
image.display({image=observation, zoom=10, win=window})
-- Display
env:render()
end

-- If game finished, start again
if terminal then
episodes = episodes + 1
observation = env:start()
end
episodes = episodes + 1
observation = env:start()
terminal = false
end
print('Episodes: ' .. episodes)
print('Total Reward: ' .. totalReward)
65 changes: 40 additions & 25 deletions rlenvs/Acrobot.lua
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
local classic = require 'classic'

local Acrobot, super = classic.class('Acrobot', Env)
Acrobot.timeStepLimit = 500

-- Constructor
function Acrobot:_init(opts)
opts = opts or {}

opts.timeStepLimit = Acrobot.timeStepLimit
super._init(self, opts)

-- Constants
self.g = opts.g or 9.8
self.m1 = opts.m1 or 1 -- Mass of link 1
Expand All @@ -21,27 +24,40 @@ function Acrobot:_init(opts)
end

-- 4 states returned, of type 'real', of dimensionality 1, with differing ranges
function Acrobot:getStateSpec()
return {
{'real', 1, {-math.pi, math.pi}}, -- Joint 1 angle
{'real', 1, {-math.pi, math.pi}}, -- Joint 2 angle
{'real', 1, {-4*math.pi, 4*math.pi}}, -- Joint 1 angular velocity
{'real', 1, {-9*math.pi, 9*math.pi}} -- Joint 2 angular velocity
function Acrobot:getStateSpace()
local state = {}
state['name'] = 'Box'
state['shape'] = {4}
state['low'] = {
-math.pi, -- Joint 1 angle
-math.pi, -- Joint 2 angle
-4 * math.pi, -- Joint 1 angular velocity
-9 * math.pi -- Joint 2 angular velocity
}
state['high'] = {
math.pi, -- Joint 1 angle
math.pi, -- Joint 2 angle
4 * math.pi, -- Joint 1 angular velocity
9 * math.pi -- Joint 2 angular velocity
}
return state
end

-- 1 action required, of type 'int', of dimensionality 1, with second torque joint in {-1, 0, 1}
function Acrobot:getActionSpec()
return {'int', 1, {-1, 1}}
function Acrobot:getActionSpace()
local action = {}
action['name'] = 'Discrete'
action['n'] = 3
return action
end

-- Min and max reward
function Acrobot:getRewardSpec()
function Acrobot:getRewardSpace()
return -1, 0
end

-- Resets the cart
function Acrobot:start()
function Acrobot:_start()
-- Reset angles and velocities
self.q1 = 0 -- Joint 1 angle
self.q2 = 0 -- Joint 2 angle
Expand All @@ -52,20 +68,19 @@ function Acrobot:start()
end

-- Swings the pole via torque on second joint
function Acrobot:step(action)
function Acrobot:_step(action)
action = action - 1 -- rescale the action
local reward = -1
local terminal = false

for t = 1, self.steps do
-- Calculate motion of system
local d1 = self.m1*math.pow(self.lc1, 2) + self.m2*(math.pow(self.l1, 2) + math.pow(self.lc2, 2) + 2*self.l1*self.lc2*math.cos(self.q2)) + self.I1 + self.I2
local d2 = self.m2*(math.pow(self.lc2, 2) + self.l1*self.lc2*math.cos(self.q2)) + self.I2
local phi2 = self.m2*self.lc2*self.g*math.cos(self.q1 + self.q2 - math.pi/2)
local phi1 = -self.m2*self.l1*self.lc2*math.pow(self.q2Dot, 2)*math.sin(self.q2) - 2*self.m2*self.l1*self.lc2*self.q2Dot*self.q1Dot*math.sin(self.q2) +
(self.m1*self.lc1 + self.m2*self.l1)*self.g*math.cos(self.q1 - math.pi/2) + phi2
local q2DotDot = (action + d2/d1*phi1 - self.m2*self.l1*self.lc2*math.pow(self.q1Dot, 2)*math.sin(self.q2) - phi2) /
(self.m2*math.pow(self.lc2, 2) + self.I2 - math.pow(d2, 2)/d1)
local q1DotDot = -(d2/q2DotDot + phi1)/d1
local d1 = self.m1 * math.pow(self.lc1, 2) + self.m2 * (math.pow(self.l1, 2) + math.pow(self.lc2, 2) + 2 * self.l1 * self.lc2 * math.cos(self.q2)) + self.I1 + self.I2
local d2 = self.m2 * (math.pow(self.lc2, 2) + self.l1 * self.lc2 * math.cos(self.q2)) + self.I2
local phi2 = self.m2 * self.lc2 * self.g * math.cos(self.q1 + self.q2 - math.pi/2)
local phi1 = -self.m2 * self.l1 * self.lc2 * math.pow(self.q2Dot, 2) * math.sin(self.q2) - 2 * self.m2 * self.l1 * self.lc2 * self.q2Dot * self.q1Dot * math.sin(self.q2) + (self.m1 * self.lc1 + self.m2 * self.l1) * self.g * math.cos(self.q1 - math.pi / 2) + phi2
local q2DotDot = (action + d2 / d1 * phi1 - self.m2 * self.l1 * self.lc2 * math.pow(self.q1Dot, 2) * math.sin(self.q2) - phi2) / (self.m2 * math.pow(self.lc2, 2) + self.I2 - math.pow(d2, 2) / d1)
local q1DotDot = -(d2 / q2DotDot + phi1) / d1

-- Update state using Euler's method
self.q1Dot = self.q1Dot + self.tau * q1DotDot
Expand All @@ -86,13 +101,13 @@ function Acrobot:step(action)
self.q2 = math.pi - (self.q2 % -math.pi)
end
-- Limit velocities
self.q1Dot = math.max(self.q1Dot, -4*math.pi)
self.q1Dot = math.min(self.q1Dot, 4*math.pi)
self.q2Dot = math.max(self.q2Dot, -9*math.pi)
self.q2Dot = math.min(self.q2Dot, 9*math.pi)
self.q1Dot = math.max(self.q1Dot, -4 * math.pi)
self.q1Dot = math.min(self.q1Dot, 4 * math.pi)
self.q2Dot = math.max(self.q2Dot, -9 * math.pi)
self.q2Dot = math.min(self.q2Dot, 9 * math.pi)

-- Terminate if second joint's height is greater than height of first joint (relative to origin)
local h = -self.l1*math.cos(self.q1) - self.l2*math.sin(math.pi/2 - self.q1 - self.q2)
local h = -self.l1 * math.cos(self.q1) - self.l2 * math.sin(math.pi / 2 - self.q1 - self.q2)
if h > self.l1 then
reward = 0
terminal = true
Expand Down
30 changes: 23 additions & 7 deletions rlenvs/Atari.lua
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,15 @@ if not hasALEWrap then
end

local Atari, super = classic.class('Atari', Env)
Atari.timeStepLimit = 100000

-- Constructor
function Atari:_init(opts)
-- Create ALEWrap options from opts
opts = opts or {}
opts.timeStepLimit = Atari.timeStepLimit
super._init(self, opts)

if opts.lifeLossTerminal == nil then
opts.lifeLossTerminal = true
end
Expand Down Expand Up @@ -44,13 +48,25 @@ function Atari:_init(opts)
end

-- 1 state returned, of type 'real', of dimensionality 3 x 210 x 160, between 0 and 1
function Atari:getStateSpec()
return {'real', {3, 210, 160}, {0, 1}}
function Atari:getStateSpace()
local state = {}
state['name'] = 'Box'
state['shape'] = {3, 210, 160}
state['low'] = {
0
}
state['high'] = {
1
}
return state
end

-- 1 action required, of type 'int', of dimensionality 1, between 1 and 18 (max)
function Atari:getActionSpec()
return {'int', 1, {1, #self.actions}}
function Atari:getActionSpace()
local action = {}
action['name'] = 'Discrete'
action['n'] = #self.actions
return action
end

-- RGB screen of height 210 and width 160
Expand All @@ -59,12 +75,12 @@ function Atari:getDisplaySpec()
end

-- Min and max reward (unknown)
function Atari:getRewardSpec()
function Atari:getRewardSpace()
return nil, nil
end

-- Starts a new game, possibly with a random number of no-ops
function Atari:start()
function Atari:_start()
local screen, reward, terminal

if self.gameEnv._random_starts > 0 then
Expand All @@ -77,7 +93,7 @@ function Atari:start()
end

-- Steps in a game
function Atari:step(action)
function Atari:_step(action)
-- Map action index to action for game
action = self.actions[action]

Expand Down
Loading