You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm new to evals so I may be missing something, but I was surprised to find that evals/elsuite/modelgraded/classify.py seem to tee up the grading model to grade a single question at a time, and that there wasn't an equivalent approach for getting multiple questions graded at once. I see that we can create our own subclasses and implement custom behavior, but before doing so, I wanted to makes sure I wasn't missing something obvious.
E.g. in a QA task, the input conversation that I want evaluated might have 10 question-answer pairs, and I want the grading model to return something like an array of grades for that sample.
Tangentially related, it would be nice to more easily run past answers through eval. So instead of querying the model, I can use responses that I've saved from a previous invocation.
For both of these, the motivation is reducing cost. My prompt has a lot of context, so it seems a little silly to pass it back in over and over again when the basic context is the same to the evaluation model. On the other hand, the developer/time costs of implementing some of these custom features probably outweigh that, but these seems like common use cases and I want to make sure I'm not overcomplicating this.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm new to evals so I may be missing something, but I was surprised to find that evals/elsuite/modelgraded/classify.py seem to tee up the grading model to grade a single question at a time, and that there wasn't an equivalent approach for getting multiple questions graded at once. I see that we can create our own subclasses and implement custom behavior, but before doing so, I wanted to makes sure I wasn't missing something obvious.
E.g. in a QA task, the input conversation that I want evaluated might have 10 question-answer pairs, and I want the grading model to return something like an array of grades for that sample.
Tangentially related, it would be nice to more easily run past answers through eval. So instead of querying the model, I can use responses that I've saved from a previous invocation.
For both of these, the motivation is reducing cost. My prompt has a lot of context, so it seems a little silly to pass it back in over and over again when the basic context is the same to the evaluation model. On the other hand, the developer/time costs of implementing some of these custom features probably outweigh that, but these seems like common use cases and I want to make sure I'm not overcomplicating this.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions