Tweaked UCB calculation for uniform exploratin of actions in vanilla MCTS #99
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With the current UCB implementation in vanilla MCTS, the first two iterations explore the same action twice:
[from the example notebook with
n_iter=2
]This is because the current implementation will set UCB value of an action to the default
q
value whensn == 1 && n(sanode) == 0
(parent node visited once and action not yet used). This means that if the action used in a state during the first iteration returned a positive reward, it would be picked again in the second iteration since it has a higher value than the unexplored actions.I think the usual UCB implementation first explores all available actions in some order before focusing on a specific action.
With this patch, this seems to be happening:
[same as previous, but with the patch applied]