Skip to content

Conversation

@Occupying-Mars
Copy link

(disclaimer: haven't tested it by actually running to see loss because gpu poor)
basic pr

confidence based rewards

added varentropy to measure confidence of treacher LLM if less confidence that means golden token is more important if more confidence that means there can be more branched out tokens in the top k tokens

later can add dynamic top_k number of tokens based on confidence if its less confidence we can choose more top_k for the distribution it gives

@tokenbender
Copy link
Owner

this is very cool but i have some plans with varentropy, avataRL can just get rid of critic and rank-cum-reward top k logits based on varentropy as an ablation.

i think we can teach a model to be interesting like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants