Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exercise 4.8: Avoid numerical instability in policy #95

Open
mhoehle opened this issue Mar 3, 2023 · 0 comments
Open

Exercise 4.8: Avoid numerical instability in policy #95

mhoehle opened this issue Mar 3, 2023 · 0 comments

Comments

@mhoehle
Copy link

mhoehle commented Mar 3, 2023

Suggested replacement:

op_a = np.argmax(np.round(v, decimals=5))

As written in ShangtongZhang/reinforcement-learning-an-introduction#83 this would result in a more deterministic action choice in situations, where several actions give identical value (which due to floating point errors are not identical). At least when I added the suggested rounding the produced figure(s) resembled Fig. 4.3. more:

grafik

This removes the artefacts seen in the original figure produced by the code (Ex4.9_plotB.jpg in the repo or see below):

grafik

P.S. The number of digits might have to be increased such that it also works for the $p_h=0.55$ example in the code (i.e. Figure Ex4.9_plotD.jpg), e.g. digits=12.

@mhoehle mhoehle changed the title Exercise 4.8: Avoid numerical instability in value function Exercise 4.8: Avoid numerical instability in policy Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant