-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT Add scale scorer #274
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking the PR merge but wanted to make sure comments were read before merging! Feel free to reach out to chat, changes are pretty minor. :)
@@ -16,36 +17,39 @@ class Scorer(abc.ABC): | |||
scorer_type: ScoreType | |||
|
|||
@abstractmethod | |||
async def score_async(self, request_response: PromptRequestPiece) -> list[Score]: | |||
async def score_async(self, request_response: PromptRequestPiece, *, task: Optional[str] = None) -> list[Score]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question. Shouldn't the definition be as follow?
async def score_async(self, request_response: PromptRequestPiece, *, task: Optional[str] = None) -> list[Score]: | |
async def score_async(self, *, request_response: PromptRequestPiece, task: Optional[str] = None) -> list[Score]: |
That's the pattern we've been used throughout the code. It will force this PR to modify more files but will keep things consistent and more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't put it there for three reasons. One is what you mentioned: it will break backwards compatibility. Perhaps not a huge reason, granted!
The other is that request_response
is not a great name IMO and if I had to bet we'll change that sooner or later at which point that would mean another breaking change (or deprecation over several versions which is incredibly painful).
Lastly, as long as there's only 1 positional arg it's still clear. As soon as you have multiple then the order could be either way and things can get confusing. The task is truly "optional" in the sense that most scorers don't currently use it, so treating it differently makes sense to me.
|
||
Returns: | ||
list[Score]: A list of Score objects representing the results. | ||
""" | ||
raise NotImplementedError("score_async method not implemented") | ||
|
||
@abstractmethod | ||
def validate(self, request_response: PromptRequestPiece): | ||
def validate(self, request_response: PromptRequestPiece, *, task: Optional[str] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggestion as below.
def validate(self, request_response: PromptRequestPiece, *, task: Optional[str] = None): | |
def validate(self, *, request_response: PromptRequestPiece, task: Optional[str] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #274 (comment)
chat_target: PromptChatTarget, | ||
scale_path: Optional[Path] = None, | ||
scale: Optional[Scale] = None, | ||
memory: MemoryInterface = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory: MemoryInterface = None, | |
memory: Optional[MemoryInterface] = None, |
Pre-commit hooks should've complained about this! :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
self._system_prompt = scoring_instructions_template.apply_custom_metaprompt_parameters(**system_prompt_kwargs) | ||
|
||
self._chat_target: PromptChatTarget = chat_target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove extra line here.
self._system_prompt = scoring_instructions_template.apply_custom_metaprompt_parameters(**system_prompt_kwargs) | |
self._chat_target: PromptChatTarget = chat_target | |
self._system_prompt = scoring_instructions_template.apply_custom_metaprompt_parameters(**system_prompt_kwargs) | |
self._chat_target: PromptChatTarget = chat_target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the rule of thumb is that if flake8 / black etc. don't mind then neither do we (?)
As discussed I'm resetting this one and will check with you offline @dlmgary ! If needed I'll start a separate follow-up PR
Description
So far we don't have a scale scorer that can accommodate more flexible scales than Likert and which don't require level-specific descriptions.
TAP, PAIR, and Crescendo all uses some variation of this. It could help to have a bigger scale and not requiring detailed descriptions of each level which are hard to create and sometimes difficult to distinguish. One could argue that there's no point in distinct levels if there's no describable distinction them but that's a whole different can of worms 😄
Tests and Documentation
Added tests. Holding off on extensive documentation for now since it uses a slightly different signature on the scoring methods which includes the task. If we end up doing the same on all existing scorers we can do a larger adjustment of the docs.