It's not feasible to update the Q-based value agent in large steps for the RandomWalk1D() environment. #1068

Van314159 · 2024-04-16T12:24:07Z

I followed the RandomWalk1D() example in the tutorial and wanted to update the agent. But run function returns BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1] if I use the TDLearner. My code is

> envRW = RandomWalk1D()
> NS = length(state_space(envRW))
> NA = length(action_space(envRW))
> agentRW = Agent(
	policy = QBasedPolicy(
           learner = TDLearner(
                   TabularQApproximator(
                       n_state = NS,
                       n_action = NA,
                   ),
                   :SARS
               ),
           explorer = EpsilonGreedyExplorer(0.1)
       ),
	trajectory = Trajectory(
           ElasticArraySARTSTraces(;
               state = Int64 => (),
               action = Int64 => (),
               reward = Float64 => (),
               terminal = Bool => (),
           ),
           DummySampler(),
           InsertSampleRatioController(),
       )
)

> run(agentRW, envRW, StopAfterNEpisodes(10), TotalRewardPerEpisode())

It returns

BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]

The above code works if I stop the simulation early, i.e., specify StopAfterNSteps(3).
It also works for RandomPolicy().

The text was updated successfully, but these errors were encountered:

johannes-fischer · 2024-04-16T15:31:08Z

The same happened to me, I think this is caused by how the algorithm handles terminating environment. Here is an example trace from this environment:

:state, :action, :terminal, :next_state
 4  1  0  3
 3  1  0  2
 2  2  0  3
 3  1  0  2
 2  1  1  1
 1  0  0  4
 4  2  0  5
 5  2  0  6
 6  2  1  7
 7  0  0  4
 4  1  0  3

So you can see, the agent marks a step as :terminal if its :next_state is a terminal state (1 or 7 in this env). After this terminal step, there is another step which has the actual terminal state as :state and the new initial state for the next step as :next_state, and this weird intermediate step has :action=0, which is not a valid action in this env and cannot be used to index the Q-table of course.

I don't know what the reason was to include the intermediate steps in the trace with :action=0, but they need to be removed somehow for learning.

HenriDeh · 2024-04-16T15:38:03Z

I see, I think that's because a DummySampler is used. The "0" actions are dummy actions pushed to the replay buffer to keep the traces in sync (you have more states that actions in an episode). These time steps should not be sampleable as they are not meaningful. There should be an alternative to DummySampler that samples the whole buffer without the invalid time steps.

johannes-fischer · 2024-04-16T16:09:14Z

But since :next_state is part of the trace, why are those intermediate time steps necessary to keep traces in sync? One could either drop these time steps OR the :next_state and wouldn't lose any information.

johannes-fischer · 2024-04-16T16:10:51Z

Or are you saying implementation-wise state and next_state are a view onto the same memory and hence it cannot be dropped?

HenriDeh · 2024-04-16T17:23:30Z

Yes they are stored in the same memory space that's why.

jeremiahpslewis · 2024-04-17T13:32:57Z

@HenriDeh Isn't the issue here?

function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory)
    for batch in trajectory.container
        optimise!(learner, stage, batch)
    end
end

e.g. if unsampleable trajectory observations were not available in the iterate method of the trajectory, things should just work?

jeremiahpslewis · 2024-04-17T13:33:41Z

To help keep track of things, here's the full stack trace for the above example:

julia> run(agentRW, envRW, StopAfterNEpisodes(10), TotalRewardPerEpisode())
ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
  [1] getindex
    @ ./essentials.jl:14 [inlined]
  [2] maybeview
    @ ./views.jl:149 [inlined]
  [3] forward
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/tabular_approximator.jl:50 [inlined]
  [4] Q
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:39 [inlined]
  [5] bellman_update!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:59 [inlined]
  [6] _optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:75 [inlined]
  [7] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:82 [inlined]
  [8] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:92 [inlined]
  [9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
    @ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:87
 [10] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/q_based_policy.jl:42 [inlined]
 [11] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:35 [inlined]
 [12] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:34 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
 [14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
    @ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:61
 [15] run
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:30 [inlined]
 [16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
    @ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:29
 [17] top-level scope
    @ REPL[16]:1
Some type information was truncated. Use `show(err)` to see complete types.

jeremiahpslewis · 2024-04-17T13:39:16Z

@HenriDeh (Just reread your RLTraj.jl issue and see you already proposed this solution). 🙈

johannes-fischer · 2024-07-22T13:41:56Z

I see, I think that's because a DummySampler is used.

I experimented with this again, but I get the same error when using BatchSampler:

using ReinforcementLearning
using ReinforcementLearningTrajectories
env = RandomWalk1D()
policy = QBasedPolicy(
    learner=TDLearner(
        TabularQApproximator(
            n_state=length(state_space(env)),
            n_action=length(action_space(env)),
        ),
        :SARS
    ),
    explorer=EpsilonGreedyExplorer(0.1)
)
trajectory = Trajectory(
    ElasticArraySARTSTraces(;
        state=Int64 => (),
        action=Int64 => (),
        reward=Float64 => (),
        terminal=Bool => (),
    ),
    BatchSampler(5),
    # DummySampler(),
    InsertSampleRatioController(),
)
agent = Agent(
    policy=policy,
    trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(10), TotalRewardPerEpisode())

This produces the following error. Is there a working way to use this package right now?

ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
  [1] getindex
    @ ./essentials.jl:14 [inlined]
  [2] maybeview
    @ ./views.jl:149 [inlined]
  [3] forward
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/tabular_approximator.jl:50 [inlined]
  [4] Q
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:39 [inlined]
  [5] bellman_update!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:59 [inlined]
  [6] _optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:75 [inlined]
  [7] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:82 [inlined]
  [8] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:92 [inlined]
  [9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:87
 [10] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/q_based_policy.jl:42 [inlined]
 [11] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:35 [inlined]
 [12] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:34 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
 [14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:61
 [15] run
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:30 [inlined]
 [16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:29
 [17] top-level scope
    @ ~/dev/jl/RLAlgorithms/scripts/investigate_traces.jl:34
Some type information was truncated. Use `show(err)` to see complete types.

johannes-fischer · 2024-07-22T15:48:31Z

Some more info: First collect traces with random policy:

agent = Agent(
    policy=RandomPolicy(),
    trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(2), TotalRewardPerEpisode())

julia> trajectory.container
EpisodesBuffer containing
Traces with 5 entries:
  :state => 9-element RelativeTrace
  :next_state => 9-element RelativeTrace
  :action => 9-elements Trace{ElasticArrays.ElasticVector{Int64, Vector{Int64}}}
  :reward => 9-elements Trace{ElasticArrays.ElasticVector{Float64, Vector{Float64}}}
  :terminal => 9-elements Trace{ElasticArrays.ElasticVector{Bool, Vector{Bool}}}
julia> l = length(trajectory.container)
9

julia> traces = trajectory.container[1:l]
(state = [4, 5, 6, 7, 4, 3, 4, 3, 2], next_state = [5, 6, 7, 4, 3, 4, 3, 2, 1], action = [2, 2, 2, 0, 1, 2, 1, 1, 1], reward = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0], terminal = Bool[0, 0, 1, 0, 0, 0, 0, 0, 1])

julia> sampl = trajectory.container.sampleable_inds[1:l]
9-element BitVector:
 1
 1
 1
 0
 1
 1
 1
 1
 1

julia> hcat(traces.state, traces.terminal, traces.action, traces.next_state, sampl)
9×5 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}:
 4  0  2  5  1
 5  0  2  6  1
 6  1  2  7  1
 7  0  0  4  0
 4  0  1  3  1
 3  0  2  4  1
 4  0  1  3  1
 3  0  1  2  1
 2  1  1  1  1

Iterating over container vs iterating over trajectory:

julia> @which iterate(trajectory.container)
iterate(A::AbstractArray)
     @ Base abstractarray.jl:1214

julia> for data in trajectory.container
           @show data
       end
data = (state = 4, next_state = 5, action = 2, reward = 0.0, terminal = false)
data = (state = 5, next_state = 6, action = 2, reward = 0.0, terminal = false)
data = (state = 6, next_state = 7, action = 2, reward = 1.0, terminal = true)
data = (state = 7, next_state = 4, action = 0, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 4, action = 2, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 2, action = 1, reward = 0.0, terminal = false)
data = (state = 2, next_state = 1, action = 1, reward = -1.0, terminal = true)

julia> @which iterate(trajectory)
iterate(t::Trajectory, args...)
     @ ReinforcementLearningTrajectories ~/dev/jl/RLAlgorithms/dev/ReinforcementLearningTrajectories/src/trajectory.jl:132

julia> for batch in trajectory
           @show batch
       end
batch = (state = [4, 5, 5], next_state = [3, 6, 6], action = [1, 2, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0])
batch = (state = [2, 4, 6], next_state = [1, 3, 7], action = [1, 1, 2], reward = [-1.0, 0.0, 1.0], terminal = Bool[1, 0, 1])
batch = (state = [2, 4, 4], next_state = [1, 3, 3], action = [1, 1, 1], reward = [-1.0, 0.0, 0.0], terminal = Bool[1, 0, 0])
batch = (state = [3, 2, 3], next_state = [4, 1, 2], action = [2, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 2, 3], next_state = [3, 1, 4], action = [1, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [5, 2, 5], next_state = [6, 1, 6], action = [2, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 2], next_state = [3, 3, 1], action = [1, 1, 1], reward = [0.0, 0.0, -1.0], terminal = Bool[0, 0, 1])
batch = (state = [4, 3, 6], next_state = [5, 4, 7], action = [2, 2, 2], reward = [0.0, 0.0, 1.0], terminal = Bool[0, 0, 1])
batch = (state = [3, 2, 4], next_state = [2, 1, 3], action = [1, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 3], next_state = [3, 3, 4], action = [1, 1, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0])

So when iterating over trajectory.container, the dummy action 0 is part of it, when iterating over the trajectory object itself, action 0 is never sampled (also tried with larger buffer).

So does that mean that

function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory)

needs to iterate over trajectory instead of over trajectory.container, as you hinted above @jeremiahpslewis ?

Apart from that, it seems a bit odd to me that iterating over a trajectory of container length 9 with BatchSampler(3) produces 10 batches of 3 samples each, totalling 30 examples (with repetitions). I would have expected it to produce N disjunct batches that in total cover the sampleable data without repetitions. But I have not fully understood how the sampler and controller of the trajectory work yet, maybe this behavior can be adjusted with them?

It also seems inconsistent to me that length(trajectory.container) == 9. The container contains 8 sampleable states and 10 actual states. For some reason the last dummy transition with action 0 is not considered part of the trace, but the other dummy actions are (length(trajectory.container.sampleable_inds) == 10).

HenriDeh mentioned this issue Apr 16, 2024

DummySampler samples invalid indices JuliaReinforcementLearning/ReinforcementLearningTrajectories.jl#69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It's not feasible to update the Q-based value agent in large steps for the RandomWalk1D() environment. #1068

It's not feasible to update the Q-based value agent in large steps for the RandomWalk1D() environment. #1068

Van314159 commented Apr 16, 2024 •

edited

Loading

johannes-fischer commented Apr 16, 2024

HenriDeh commented Apr 16, 2024

johannes-fischer commented Apr 16, 2024

johannes-fischer commented Apr 16, 2024

HenriDeh commented Apr 16, 2024

jeremiahpslewis commented Apr 17, 2024 •

edited

Loading

jeremiahpslewis commented Apr 17, 2024

jeremiahpslewis commented Apr 17, 2024

johannes-fischer commented Jul 22, 2024 •

edited

Loading

johannes-fischer commented Jul 22, 2024 •

edited

Loading

It's not feasible to update the Q-based value agent in large steps for the RandomWalk1D() environment. #1068

It's not feasible to update the Q-based value agent in large steps for the RandomWalk1D() environment. #1068

Comments

Van314159 commented Apr 16, 2024 • edited Loading

johannes-fischer commented Apr 16, 2024

HenriDeh commented Apr 16, 2024

johannes-fischer commented Apr 16, 2024

johannes-fischer commented Apr 16, 2024

HenriDeh commented Apr 16, 2024

jeremiahpslewis commented Apr 17, 2024 • edited Loading

jeremiahpslewis commented Apr 17, 2024

jeremiahpslewis commented Apr 17, 2024

johannes-fischer commented Jul 22, 2024 • edited Loading

johannes-fischer commented Jul 22, 2024 • edited Loading

Van314159 commented Apr 16, 2024 •

edited

Loading

jeremiahpslewis commented Apr 17, 2024 •

edited

Loading

johannes-fischer commented Jul 22, 2024 •

edited

Loading

johannes-fischer commented Jul 22, 2024 •

edited

Loading