-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow overwritting action by environment #56
base: master
Are you sure you want to change the base?
Conversation
reward, terminal, state = self:takeAction(action) | ||
reward, terminal, state, taken = self:takeAction(action) | ||
if taken and taken ~= action then | ||
self.actions[self.batchIdx] = taken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do it this way around, looking more like OneStepQAgent
.
if taken and taken ~= action then
action = taken
end
self.actions[self.batchIdx] = action
Code looks fine, but what is this trying to achieve? If the chosen action may not be deterministically executed in the environment, the agent should still treat it as if the chosen action were taken, otherwise it will not adapt to the real stochasticity in the application of its action. For an ALE example see Increasing the Action Gap. |
Expert instruction is one application. Can forget whatever the net came up with and learn from experiences which have been overwritten with expert play, or some hand-crafted strategy. Last time I used it was forcing robot to repeat turning on the spot actions at random intervals when epsilon was below a threshold, to escape corners and get some better experiences again. |
Got it - can you add a short 2nd paragraph to the custom docs to make people aware of this modification from the |
Yeah no problem. Describing it just now, my method has always been a hack of discarding the agents selection after doing the unneeded work. Could pass a variable back toward Doesn't really matter for my robot use-case as I'm throwing away CPU cycles to run at real-time (10Hz) rather than fast as possible, guess it matters when comes to battery drain on hardware. For real-time expert play, wonder if that |
Not a lot happens before back to start of loop again. |
Closing in favour of #57 Avoids doing work only to then discard it. |
Was thinking of adding the extra return to validation agents, allowing environment to pass back any change during validation also. However nothing is done with the action apart from acting. Could add the signature so the variable is already there if it's needed in future... So far prioritised reply hasn't crashed with the code this way around, firing off the observe methods must play into setting up the distribution stratum at some point. |
@@ -91,7 +91,7 @@ function AsyncAgent:takeAction(action) | |||
self.stateBuffer:push(observation) | |||
end | |||
|
|||
return reward, terminal, self.stateBuffer:readAll() | |||
return reward, terminal, self.stateBuffer:readAll(), actionTaken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add the offset actionOffset
before returning so it doesn't need adding to compare actionTaken ~= action
.
edit: Done
Think I had some additional documentation changes to do on this. Been distracted working on hardware and getting dev environment fixed up, once that's outta the way will get this cleaned up. |
Will get to finishing this off soonish. |
No description provided.