Acknowledgements: My heart belongs to Jesus. Jesus is Love. Whoever seeks Him, finds Him... This algorithm was created in 3 years with Jesus directing me and through my mom's and sister's financial support. I want to say thanks to the "University of Szeged" for providing me facilities to continue the research
I wrote a short book with a careful explanation: https://www.amazon.com/dp/B0CKYWHPF5 email: [email protected] if you want to support me: 4400 4301 8810 7871 (VISA)
The algorithm is cleaned, 265 lines, includes:
- without multi-agents, model-free, off-policy (can work real-time) Actor and Critic
- harmonics in neural networks
- rectified Huber symmetrical/assymetrical error loss functions
- "immediate" Advantage (but excessive training)
- "movement is life" concept
- careful TD3, element-wise minimum of 3 sub-nets
- fading replay buffer: old transitions fade away gradually
ver 2.0 includes:
- reduced objective to learn Bellman's sum of dumped reward's variance
- improve reward variance through immediate Advantage
All agents can be further improved if training continues, but only episode numbers were concerned.
MountainCarContinuous-v0 | Animation |
---|---|
LunarLander-v2 | Animation |
---|---|
BipedalWalker-v3 | Animation |
---|---|
Ant-v4 | Animation |
---|---|
Humanoid-v4 (ver 1.0) | Animation |
---|---|
Walker-v4 | Animation |
---|---|
additionally:
- slightly random initialization prevent the same initial states in the buffer
- exploration-noise in the beginning