Replies: 1 comment
-
I think this is a mistake: the code is computing the KL from the "ref_model", not the behaviour cloning model (the initial policy). The WebGPT paper says "The KL here is measure from the BC model and summed over the episode". It would be more informative (easier to compare to papers) if trlx logged the KL from the BC model/original policy/finetuned model |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In the openAI papers they stop training when KL hits about 10 nats, how do I know when this is hit using trlx WnB logs? it feels like approx_kl should be the thing but clearly that's not it.
(sorry if this is wrong place to ask, I never know where to put these questions: discord, issues, or discussions?)
Beta Was this translation helpful? Give feedback.
All reactions