-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It's not feasible to update the Q-based value agent in large steps for the RandomWalk1D() environment. #1068
Comments
The same happened to me, I think this is caused by how the algorithm handles terminating environment. Here is an example trace from this environment:
So you can see, the agent marks a step as I don't know what the reason was to include the intermediate steps in the trace with |
I see, I think that's because a DummySampler is used. The "0" actions are dummy actions pushed to the replay buffer to keep the traces in sync (you have more states that actions in an episode). These time steps should not be sampleable as they are not meaningful. There should be an alternative to DummySampler that samples the whole buffer without the invalid time steps. |
But since |
Or are you saying implementation-wise |
Yes they are stored in the same memory space that's why. |
@HenriDeh Isn't the issue here? function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory)
for batch in trajectory.container
optimise!(learner, stage, batch)
end
end e.g. if unsampleable trajectory observations were not available in the |
To help keep track of things, here's the full stack trace for the above example: julia> run(agentRW, envRW, StopAfterNEpisodes(10), TotalRewardPerEpisode())
ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
[1] getindex
@ ./essentials.jl:14 [inlined]
[2] maybeview
@ ./views.jl:149 [inlined]
[3] forward
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/tabular_approximator.jl:50 [inlined]
[4] Q
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:39 [inlined]
[5] bellman_update!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:59 [inlined]
[6] _optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:75 [inlined]
[7] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:82 [inlined]
[8] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:92 [inlined]
[9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
@ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:87
[10] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/q_based_policy.jl:42 [inlined]
[11] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:35 [inlined]
[12] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:34 [inlined]
[13] macro expansion
@ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
[14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
@ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:61
[15] run
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:30 [inlined]
[16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
@ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:29
[17] top-level scope
@ REPL[16]:1
Some type information was truncated. Use `show(err)` to see complete types. |
@HenriDeh (Just reread your RLTraj.jl issue and see you already proposed this solution). 🙈 |
I experimented with this again, but I get the same error when using using ReinforcementLearning
using ReinforcementLearningTrajectories
env = RandomWalk1D()
policy = QBasedPolicy(
learner=TDLearner(
TabularQApproximator(
n_state=length(state_space(env)),
n_action=length(action_space(env)),
),
:SARS
),
explorer=EpsilonGreedyExplorer(0.1)
)
trajectory = Trajectory(
ElasticArraySARTSTraces(;
state=Int64 => (),
action=Int64 => (),
reward=Float64 => (),
terminal=Bool => (),
),
BatchSampler(5),
# DummySampler(),
InsertSampleRatioController(),
)
agent = Agent(
policy=policy,
trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(10), TotalRewardPerEpisode()) This produces the following error. Is there a working way to use this package right now? ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
[1] getindex
@ ./essentials.jl:14 [inlined]
[2] maybeview
@ ./views.jl:149 [inlined]
[3] forward
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/tabular_approximator.jl:50 [inlined]
[4] Q
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:39 [inlined]
[5] bellman_update!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:59 [inlined]
[6] _optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:75 [inlined]
[7] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:82 [inlined]
[8] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:92 [inlined]
[9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
@ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:87
[10] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/q_based_policy.jl:42 [inlined]
[11] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:35 [inlined]
[12] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:34 [inlined]
[13] macro expansion
@ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
[14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
@ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:61
[15] run
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:30 [inlined]
[16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
@ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:29
[17] top-level scope
@ ~/dev/jl/RLAlgorithms/scripts/investigate_traces.jl:34
Some type information was truncated. Use `show(err)` to see complete types. |
Some more info: First collect traces with random policy: agent = Agent(
policy=RandomPolicy(),
trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(2), TotalRewardPerEpisode()) julia> trajectory.container
EpisodesBuffer containing
Traces with 5 entries:
:state => 9-element RelativeTrace
:next_state => 9-element RelativeTrace
:action => 9-elements Trace{ElasticArrays.ElasticVector{Int64, Vector{Int64}}}
:reward => 9-elements Trace{ElasticArrays.ElasticVector{Float64, Vector{Float64}}}
:terminal => 9-elements Trace{ElasticArrays.ElasticVector{Bool, Vector{Bool}}}
julia> l = length(trajectory.container)
9
julia> traces = trajectory.container[1:l]
(state = [4, 5, 6, 7, 4, 3, 4, 3, 2], next_state = [5, 6, 7, 4, 3, 4, 3, 2, 1], action = [2, 2, 2, 0, 1, 2, 1, 1, 1], reward = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0], terminal = Bool[0, 0, 1, 0, 0, 0, 0, 0, 1])
julia> sampl = trajectory.container.sampleable_inds[1:l]
9-element BitVector:
1
1
1
0
1
1
1
1
1
julia> hcat(traces.state, traces.terminal, traces.action, traces.next_state, sampl)
9×5 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}:
4 0 2 5 1
5 0 2 6 1
6 1 2 7 1
7 0 0 4 0
4 0 1 3 1
3 0 2 4 1
4 0 1 3 1
3 0 1 2 1
2 1 1 1 1 Iterating over container vs iterating over trajectory: julia> @which iterate(trajectory.container)
iterate(A::AbstractArray)
@ Base abstractarray.jl:1214
julia> for data in trajectory.container
@show data
end
data = (state = 4, next_state = 5, action = 2, reward = 0.0, terminal = false)
data = (state = 5, next_state = 6, action = 2, reward = 0.0, terminal = false)
data = (state = 6, next_state = 7, action = 2, reward = 1.0, terminal = true)
data = (state = 7, next_state = 4, action = 0, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 4, action = 2, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 2, action = 1, reward = 0.0, terminal = false)
data = (state = 2, next_state = 1, action = 1, reward = -1.0, terminal = true)
julia> @which iterate(trajectory)
iterate(t::Trajectory, args...)
@ ReinforcementLearningTrajectories ~/dev/jl/RLAlgorithms/dev/ReinforcementLearningTrajectories/src/trajectory.jl:132
julia> for batch in trajectory
@show batch
end
batch = (state = [4, 5, 5], next_state = [3, 6, 6], action = [1, 2, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0])
batch = (state = [2, 4, 6], next_state = [1, 3, 7], action = [1, 1, 2], reward = [-1.0, 0.0, 1.0], terminal = Bool[1, 0, 1])
batch = (state = [2, 4, 4], next_state = [1, 3, 3], action = [1, 1, 1], reward = [-1.0, 0.0, 0.0], terminal = Bool[1, 0, 0])
batch = (state = [3, 2, 3], next_state = [4, 1, 2], action = [2, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 2, 3], next_state = [3, 1, 4], action = [1, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [5, 2, 5], next_state = [6, 1, 6], action = [2, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 2], next_state = [3, 3, 1], action = [1, 1, 1], reward = [0.0, 0.0, -1.0], terminal = Bool[0, 0, 1])
batch = (state = [4, 3, 6], next_state = [5, 4, 7], action = [2, 2, 2], reward = [0.0, 0.0, 1.0], terminal = Bool[0, 0, 1])
batch = (state = [3, 2, 4], next_state = [2, 1, 3], action = [1, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 3], next_state = [3, 3, 4], action = [1, 1, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0]) So when iterating over So does that mean that function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory) needs to iterate over Apart from that, it seems a bit odd to me that iterating over a It also seems inconsistent to me that |
I followed the
RandomWalk1D()
example in the tutorial and wanted to update the agent. Butrun
function returnsBoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
if I use theTDLearner
. My code isIt returns
The above code works if I stop the simulation early, i.e., specify
StopAfterNSteps(3)
.It also works for
RandomPolicy()
.The text was updated successfully, but these errors were encountered: