-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPO Tutorial #2156
PPO Tutorial #2156
Conversation
✅ Deploy Preview for pytorch-tutorials-preview ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some editorial suggestions. Let me know if you have any questions.
# ------------- | ||
# | ||
# The PPO loss can be directly imported from torchrl for convenience using the | ||
# :class:`ClipPPOLoss` class. This is the easiest way of utilising PPO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# :class:`ClipPPOLoss` class. This is the easiest way of utilising PPO: | |
# :class:`ClipPPOLoss` class. This is the easiest way of utilizing PPO: |
# is a value that reflects an expectancy over the return value while dealing with | ||
# the bias / variance tradeoff. | ||
# To compute the advantage, one just needs to (1) build the advantage module, which | ||
# utilises our value operator, and (2) pass each batch of data through it before each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# utilises our value operator, and (2) pass each batch of data through it before each | |
# utilizes our value operator, and (2) pass each batch of data through it before each |
# Training loop | ||
# ------------- | ||
# We now have all the pieces needed to code our training loop. | ||
# The steps are quite easy: collect data, compute advantage, loop over the collected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The steps are quite easy: collect data, compute advantage, loop over the collected | |
# The steps include: | |
# | |
# * Collect data | |
# * Compute advantage | |
# * Loop over the collected to compute loss values | |
# * Back propagate | |
# * Optimize | |
# * Repeat |
# ------------- | ||
# We now have all the pieces needed to code our training loop. | ||
# The steps are quite easy: collect data, compute advantage, loop over the collected | ||
# data to compute loss values, backpropagate, optimize and repeat. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# data to compute loss values, backpropagate, optimize and repeat. |
plt.show() | ||
|
||
###################################################################### | ||
# Next steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a Conclusion section. I think you can rewrite this section a bit to become the conclusion section. For example:
In this tutorial, we have learned:
1.
2.
3.
If you want to experiment with this tutorial a bit more, you can apply the following modifications:
* This algorithm ...
* From a logging perspective .....
For more information about TorchRL, go to <link to the TorchRL docs>
@svekars this is ready for review |
8165c0c
to
d9ad408
Compare
Introduces a PPO tutorial using TorchRL.