You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/common/logger.rst
+81
Original file line number
Diff line number
Diff line change
@@ -28,5 +28,86 @@ Available formats are ``["stdout", "csv", "log", "tensorboard", "json"]``.
28
28
model.set_logger(new_logger)
29
29
model.learn(10000)
30
30
31
+
32
+
Explanation of logger output
33
+
----------------------------
34
+
35
+
You can find below short explanations of the values logged in Stable-Baselines3 (SB3).
36
+
Depending on the algorithm used and of the wrappers/callbacks applied, SB3 only logs a subset of those keys during training.
37
+
38
+
Below you can find an example of the logger output when training a PPO agent:
39
+
40
+
.. code-block:: bash
41
+
42
+
-----------------------------------------
43
+
| eval/ ||
44
+
| mean_ep_length | 200 |
45
+
| mean_reward | -157 |
46
+
| rollout/ ||
47
+
| ep_len_mean | 200 |
48
+
| ep_rew_mean | -227 |
49
+
| time/ ||
50
+
| fps | 972 |
51
+
| iterations | 19 |
52
+
| time_elapsed | 80 |
53
+
| total_timesteps | 77824 |
54
+
| train/ ||
55
+
| approx_kl | 0.037781604 |
56
+
| clip_fraction | 0.243 |
57
+
| clip_range | 0.2 |
58
+
| entropy_loss | -1.06 |
59
+
| explained_variance | 0.999 |
60
+
| learning_rate | 0.001 |
61
+
| loss | 0.245 |
62
+
| n_updates | 180 |
63
+
| policy_gradient_loss | -0.00398 |
64
+
| std | 0.205 |
65
+
| value_loss | 0.226 |
66
+
-----------------------------------------
67
+
68
+
69
+
eval/
70
+
^^^^^
71
+
All ``eval/`` values are computed by the ``EvalCallback``.
72
+
73
+
- ``mean_ep_length``: Mean episode length
74
+
- ``mean_reward``: Mean episodic reward (during evaluation)
75
+
- ``success_rate``: Mean success rate during evaluation (1.0 means 100% success), the environment info dict must contain an ``is_success`` key to compute that value
76
+
77
+
rollout/
78
+
^^^^^^^^
79
+
- ``ep_len_mean``: Mean episode length (averaged over 100 episodes)
80
+
- ``ep_rew_mean``: Mean episodic training reward (averaged over 100 episodes), a ``Monitor`` wrapper is required to compute that value (automatically added by `make_vec_env`).
81
+
- ``exploration_rate``: Current value of the exploration rate when using DQN, it corresponds to the fraction of actions taken randomly (epsilon of the "epsilon-greedy" exploration)
82
+
- ``success_rate``: Mean success rate during training (averaged over 100 episodes), you must pass an extra argument to the ``Monitor`` wrapper to log that value (``info_keywords=("is_success",)``) and provide ``info["is_success"]=True/False`` on the final step of the episode
83
+
84
+
time/
85
+
^^^^^
86
+
- ``episodes``: Total number of episodes
87
+
- ``fps``: Number of frames per seconds (includes time taken by gradient update)
88
+
- ``iterations``: Number of iterations (data collection + policy update for A2C/PPO)
89
+
- ``time_elapsed``: Time in seconds since the beginning of training
90
+
- ``total_timesteps``: Total number of timesteps (steps in the environments)
91
+
92
+
train/
93
+
^^^^^^
94
+
- ``actor_loss``: Current value for the actor loss for off-policy algorithms
95
+
- ``approx_kl``: approximate mean KL divergence between old and new policy (for PPO), it is an estimation of how much changes happened in the update
96
+
- ``clip_fraction``: mean fraction of surrogate loss that was clipped (above ``clip_range`` threshold) for PPO.
97
+
- ``clip_range``: Current value of the clipping factor for the surrogate loss of PPO
98
+
- ``critic_loss``: Current value for the critic function loss for off-policy algorithms, usually error between value function output and TD(0), temporal difference estimate
99
+
- ``ent_coef``: Current value of the entropy coefficient (when using SAC)
100
+
- ``ent_coef_loss``: Current value of the entropy coefficient loss (when using SAC)
101
+
- ``entropy_loss``: Mean value of the entropy loss (negative of the average policy entropy)
102
+
- ``explained_variance``: Fraction of the return variance explained by the value function, see https://scikit-learn.org/stable/modules/model_evaluation.html#explained-variance-score
103
+
(ev=0 => might as well have predicted zero, ev=1 => perfect prediction, ev<0 => worse than just predicting zero)
104
+
- ``learning_rate``: Current learning rate value
105
+
- ``loss``: Current total loss value
106
+
- ``n_updates``: Number of gradient updates applied so far
107
+
- ``policy_gradient_loss``: Current value of the policy gradient loss (its value does not have much meaning)
108
+
- ``value_loss``: Current value for the value function loss for on-policy algorithms, usually error between value function output and Monte-Carle estimate (or TD(lambda) estimate)
109
+
- ``std``: Current standard deviation of the noise when using generalized State-Dependent Exploration (gSDE)
0 commit comments