Difference between kl_ctl_value
, kl_per_token
, etc. in Tensorboard logs?
#558
-
Title says it all and I can't find any formal definitions for these terms online. Can anyone define these for me?
|
Beta Was this translation helpful? Give feedback.
Answered by
maxreciprocate
Sep 13, 2023
Replies: 1 comment 3 replies
-
Sure!
[1] https://arxiv.org/abs/1909.08593 Section 2.2 |
Beta Was this translation helpful? Give feedback.
3 replies
Answer selected by
doyled-it
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Sure!
kl_per_token
is a KL divergence between the initial model's policy and the current onekl_ctl_value
is a scalar for the KL penalty, also referred to asapprox_kl
is a KL during PPO minibatch updates[1] https://arxiv.org/abs/1909.08593 Section 2.2
[2] http://joschu.net/blog/kl-approx.html
[3] https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L151
[4] https://github.com/vwxyzjn/cleanrl/blob/f36d4a642…