Skip to content

Difference between kl_ctl_value, kl_per_token, etc. in Tensorboard logs? #558

Answered by maxreciprocate
doyled-it asked this question in Q&A
Discussion options

You must be logged in to vote

Sure!

  • kl_per_token is a KL divergence between the initial model's policy and the current one $D_\text{KL}(\pi_t \mid \pi_0)$ [1] measured per token with an unbiased estimator [2]
  • kl_ctl_value is a scalar for the KL penalty, also referred to as $\beta$ or $\lambda_{KL}$, the current name just comes from openai's code [3]
  • approx_kl is a KL during PPO minibatch updates $D_\text{KL}(\pi_{t+1} \mid \pi_t)$ [4]

[1] https://arxiv.org/abs/1909.08593 Section 2.2
[2] http://joschu.net/blog/kl-approx.html
[3] https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L151
[4] https://github.com/vwxyzjn/cleanrl/blob/f36d4a642…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@doyled-it
Comment options

@maxreciprocate
Comment options

@MITMhsu
Comment options

Answer selected by doyled-it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants