Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's not feasible to update the Q-based value agent in large steps for the RandomWalk1D() environment. #1068

Open
Van314159 opened this issue Apr 16, 2024 · 10 comments

Comments

@Van314159
Copy link

Van314159 commented Apr 16, 2024

I followed the RandomWalk1D() example in the tutorial and wanted to update the agent. But run function returns BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1] if I use the TDLearner. My code is

> envRW = RandomWalk1D()
> NS = length(state_space(envRW))
> NA = length(action_space(envRW))
> agentRW = Agent(
	policy = QBasedPolicy(
           learner = TDLearner(
                   TabularQApproximator(
                       n_state = NS,
                       n_action = NA,
                   ),
                   :SARS
               ),
           explorer = EpsilonGreedyExplorer(0.1)
       ),
	trajectory = Trajectory(
           ElasticArraySARTSTraces(;
               state = Int64 => (),
               action = Int64 => (),
               reward = Float64 => (),
               terminal = Bool => (),
           ),
           DummySampler(),
           InsertSampleRatioController(),
       )
)

> run(agentRW, envRW, StopAfterNEpisodes(10), TotalRewardPerEpisode())

It returns

BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]

The above code works if I stop the simulation early, i.e., specify StopAfterNSteps(3).
It also works for RandomPolicy().

@johannes-fischer
Copy link
Contributor

The same happened to me, I think this is caused by how the algorithm handles terminating environment. Here is an example trace from this environment:

:state, :action, :terminal, :next_state
 4  1  0  3
 3  1  0  2
 2  2  0  3
 3  1  0  2
 2  1  1  1
 1  0  0  4
 4  2  0  5
 5  2  0  6
 6  2  1  7
 7  0  0  4
 4  1  0  3

So you can see, the agent marks a step as :terminal if its :next_state is a terminal state (1 or 7 in this env). After this terminal step, there is another step which has the actual terminal state as :state and the new initial state for the next step as :next_state, and this weird intermediate step has :action=0, which is not a valid action in this env and cannot be used to index the Q-table of course.

I don't know what the reason was to include the intermediate steps in the trace with :action=0, but they need to be removed somehow for learning.

@HenriDeh
Copy link
Member

I see, I think that's because a DummySampler is used. The "0" actions are dummy actions pushed to the replay buffer to keep the traces in sync (you have more states that actions in an episode). These time steps should not be sampleable as they are not meaningful. There should be an alternative to DummySampler that samples the whole buffer without the invalid time steps.

@johannes-fischer
Copy link
Contributor

But since :next_state is part of the trace, why are those intermediate time steps necessary to keep traces in sync? One could either drop these time steps OR the :next_state and wouldn't lose any information.

@johannes-fischer
Copy link
Contributor

Or are you saying implementation-wise state and next_state are a view onto the same memory and hence it cannot be dropped?

@HenriDeh
Copy link
Member

Yes they are stored in the same memory space that's why.

@jeremiahpslewis
Copy link
Member

jeremiahpslewis commented Apr 17, 2024

@HenriDeh Isn't the issue here?

function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory)
    for batch in trajectory.container
        optimise!(learner, stage, batch)
    end
end

e.g. if unsampleable trajectory observations were not available in the iterate method of the trajectory, things should just work?

@jeremiahpslewis
Copy link
Member

To help keep track of things, here's the full stack trace for the above example:

julia> run(agentRW, envRW, StopAfterNEpisodes(10), TotalRewardPerEpisode())
ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
  [1] getindex
    @ ./essentials.jl:14 [inlined]
  [2] maybeview
    @ ./views.jl:149 [inlined]
  [3] forward
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/tabular_approximator.jl:50 [inlined]
  [4] Q
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:39 [inlined]
  [5] bellman_update!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:59 [inlined]
  [6] _optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:75 [inlined]
  [7] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:82 [inlined]
  [8] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:92 [inlined]
  [9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
    @ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:87
 [10] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/q_based_policy.jl:42 [inlined]
 [11] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:35 [inlined]
 [12] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:34 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
 [14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
    @ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:61
 [15] run
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:30 [inlined]
 [16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
    @ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:29
 [17] top-level scope
    @ REPL[16]:1
Some type information was truncated. Use `show(err)` to see complete types.

@jeremiahpslewis
Copy link
Member

@HenriDeh (Just reread your RLTraj.jl issue and see you already proposed this solution). 🙈

@johannes-fischer
Copy link
Contributor

johannes-fischer commented Jul 22, 2024

I see, I think that's because a DummySampler is used.

I experimented with this again, but I get the same error when using BatchSampler:

using ReinforcementLearning
using ReinforcementLearningTrajectories
env = RandomWalk1D()
policy = QBasedPolicy(
    learner=TDLearner(
        TabularQApproximator(
            n_state=length(state_space(env)),
            n_action=length(action_space(env)),
        ),
        :SARS
    ),
    explorer=EpsilonGreedyExplorer(0.1)
)
trajectory = Trajectory(
    ElasticArraySARTSTraces(;
        state=Int64 => (),
        action=Int64 => (),
        reward=Float64 => (),
        terminal=Bool => (),
    ),
    BatchSampler(5),
    # DummySampler(),
    InsertSampleRatioController(),
)
agent = Agent(
    policy=policy,
    trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(10), TotalRewardPerEpisode())

This produces the following error. Is there a working way to use this package right now?

ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
  [1] getindex
    @ ./essentials.jl:14 [inlined]
  [2] maybeview
    @ ./views.jl:149 [inlined]
  [3] forward
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/tabular_approximator.jl:50 [inlined]
  [4] Q
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:39 [inlined]
  [5] bellman_update!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:59 [inlined]
  [6] _optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:75 [inlined]
  [7] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:82 [inlined]
  [8] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:92 [inlined]
  [9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:87
 [10] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/q_based_policy.jl:42 [inlined]
 [11] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:35 [inlined]
 [12] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:34 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
 [14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:61
 [15] run
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:30 [inlined]
 [16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:29
 [17] top-level scope
    @ ~/dev/jl/RLAlgorithms/scripts/investigate_traces.jl:34
Some type information was truncated. Use `show(err)` to see complete types.

@johannes-fischer
Copy link
Contributor

johannes-fischer commented Jul 22, 2024

Some more info: First collect traces with random policy:

agent = Agent(
    policy=RandomPolicy(),
    trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(2), TotalRewardPerEpisode())
julia> trajectory.container
EpisodesBuffer containing
Traces with 5 entries:
  :state => 9-element RelativeTrace
  :next_state => 9-element RelativeTrace
  :action => 9-elements Trace{ElasticArrays.ElasticVector{Int64, Vector{Int64}}}
  :reward => 9-elements Trace{ElasticArrays.ElasticVector{Float64, Vector{Float64}}}
  :terminal => 9-elements Trace{ElasticArrays.ElasticVector{Bool, Vector{Bool}}}
julia> l = length(trajectory.container)
9

julia> traces = trajectory.container[1:l]
(state = [4, 5, 6, 7, 4, 3, 4, 3, 2], next_state = [5, 6, 7, 4, 3, 4, 3, 2, 1], action = [2, 2, 2, 0, 1, 2, 1, 1, 1], reward = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0], terminal = Bool[0, 0, 1, 0, 0, 0, 0, 0, 1])

julia> sampl = trajectory.container.sampleable_inds[1:l]
9-element BitVector:
 1
 1
 1
 0
 1
 1
 1
 1
 1

julia> hcat(traces.state, traces.terminal, traces.action, traces.next_state, sampl)
9×5 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}:
 4  0  2  5  1
 5  0  2  6  1
 6  1  2  7  1
 7  0  0  4  0
 4  0  1  3  1
 3  0  2  4  1
 4  0  1  3  1
 3  0  1  2  1
 2  1  1  1  1

Iterating over container vs iterating over trajectory:

julia> @which iterate(trajectory.container)
iterate(A::AbstractArray)
     @ Base abstractarray.jl:1214

julia> for data in trajectory.container
           @show data
       end
data = (state = 4, next_state = 5, action = 2, reward = 0.0, terminal = false)
data = (state = 5, next_state = 6, action = 2, reward = 0.0, terminal = false)
data = (state = 6, next_state = 7, action = 2, reward = 1.0, terminal = true)
data = (state = 7, next_state = 4, action = 0, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 4, action = 2, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 2, action = 1, reward = 0.0, terminal = false)
data = (state = 2, next_state = 1, action = 1, reward = -1.0, terminal = true)

julia> @which iterate(trajectory)
iterate(t::Trajectory, args...)
     @ ReinforcementLearningTrajectories ~/dev/jl/RLAlgorithms/dev/ReinforcementLearningTrajectories/src/trajectory.jl:132

julia> for batch in trajectory
           @show batch
       end
batch = (state = [4, 5, 5], next_state = [3, 6, 6], action = [1, 2, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0])
batch = (state = [2, 4, 6], next_state = [1, 3, 7], action = [1, 1, 2], reward = [-1.0, 0.0, 1.0], terminal = Bool[1, 0, 1])
batch = (state = [2, 4, 4], next_state = [1, 3, 3], action = [1, 1, 1], reward = [-1.0, 0.0, 0.0], terminal = Bool[1, 0, 0])
batch = (state = [3, 2, 3], next_state = [4, 1, 2], action = [2, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 2, 3], next_state = [3, 1, 4], action = [1, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [5, 2, 5], next_state = [6, 1, 6], action = [2, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 2], next_state = [3, 3, 1], action = [1, 1, 1], reward = [0.0, 0.0, -1.0], terminal = Bool[0, 0, 1])
batch = (state = [4, 3, 6], next_state = [5, 4, 7], action = [2, 2, 2], reward = [0.0, 0.0, 1.0], terminal = Bool[0, 0, 1])
batch = (state = [3, 2, 4], next_state = [2, 1, 3], action = [1, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 3], next_state = [3, 3, 4], action = [1, 1, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0])

So when iterating over trajectory.container, the dummy action 0 is part of it, when iterating over the trajectory object itself, action 0 is never sampled (also tried with larger buffer).

So does that mean that

function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory)

needs to iterate over trajectory instead of over trajectory.container, as you hinted above @jeremiahpslewis ?


Apart from that, it seems a bit odd to me that iterating over a trajectory of container length 9 with BatchSampler(3) produces 10 batches of 3 samples each, totalling 30 examples (with repetitions). I would have expected it to produce N disjunct batches that in total cover the sampleable data without repetitions. But I have not fully understood how the sampler and controller of the trajectory work yet, maybe this behavior can be adjusted with them?


It also seems inconsistent to me that length(trajectory.container) == 9. The container contains 8 sampleable states and 10 actual states. For some reason the last dummy transition with action 0 is not considered part of the trace, but the other dummy actions are (length(trajectory.container.sampleable_inds) == 10).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants