-
Notifications
You must be signed in to change notification settings - Fork 358
Refacto and remove bloated code #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the prompt-building and evaluation logic in lighteval by removing legacy request wrappers, unifying data structures (Doc and ModelResponse), and simplifying pipeline and registry handling.
- Introduces a single
Doc
dataclass for all task inputs and a unifiedModelResponse
- Replaces multiple request types and response classes with
SamplingMethod
andModelResponse
- Updates
Pipeline
,Registry
, and prompt management to work with the new structures
Reviewed Changes
Copilot reviewed 84 out of 89 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
tests/utils.py | Update FakeModel to return ModelResponse and use Doc |
src/lighteval/tasks/default_prompts.py | Changed default prompt construction, removed instructions |
src/lighteval/tasks/requests.py | Replaced old request classes with a large Doc dataclass |
src/lighteval/models/model_output.py | Consolidated response types into a single, expanded ModelResponse |
Comments suppressed due to low confidence (1)
src/lighteval/tasks/default_prompts.py:64
- The
instructions
variable was removed from the default prompt, so any task-specific instructions will no longer appear. Consider restoringinstructions
(e.g.f"{instructions}\n{question}\n{formatted_choices}"
) or explicitly handling wheninstructions
is empty.
prompt = f"\n{question}\n{formatted_choices}"
Co-authored-by: Copilot <[email protected]>
return LogprobCorpusMetricInput(golds=gold_ixs, preds=np.argmax(choices_logprob)) | ||
|
||
|
||
class TargetPerplexityPreparator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introduce a new class instead of adding a is_target
(False be default) parameter to the next one? (esp when so much of the code is the same)
if num_samples > 1 and self.generation_config_dict["temperature"] == 0: | ||
raise ValueError( | ||
"You cannot generate multiple samples with temperature=0. Please set temperature > 0. Or use a non sampling metric." | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could not put this one in the abstract class
…ace/lighteval into nathan-refactor-prompt-building
…ace/lighteval into nathan-refactor-prompt-building
…ace/lighteval into nathan-refactor-prompt-building
pad_amount = global_max_choices - cont_batch.shape[0] | ||
padded = F.pad(cont_batch, (0, pad_amount), value=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be
pad_amount = global_max_choices - cont_batch.shape[1]
padded = F.pad(cont_batch, (0, pad_amount), value=-1)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum then I have other shape errors in torch.stack
. something looks wrong here
## What does this PR do? This PR gives the prompt building logic in lighteval a much-needed spring cleaning The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥 ### Highlights - **Prompt Manager Overhaul:** Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it). - **system-prompt**: now part of the model config - **use-chat-template**: now part of model config - **Metrics Slimdown:** Metrics now only care about `samplingMethod` (generative or loglikelihood). Say goodbye to `use_case` and all those old request types. - **Request Layer Gone:** Models get the raw `Doc` directly -—no more unnecessary `request` wrappers that were bloating the code. - **Unified ModelResponse:** All models return a single `ModelResponse` type, whether generative or loglikelihood. This means simpler logging and metric computation. - **Consistent Metric Signatures:** Every metric now uses the same function signature: `compute(doc: Doc, model_response: ModelResponse)`. - **Standardized Details:** Each sample’s details now always include three fields: doc, metric, and model_response. - **Generative Metrics Unified:** All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0. **Exception will be raised if the user tries to run a sampling metric with temp = 0** - **Removed Loglikelihood Single Token:** bloated and almost not used - **Tests:** All tests pass, and no changes were needed to expected values. ### Why? - Less code, fewer headaches. - Easier to add new benchmarks (including weird and wonderful ones). - More user-friendly inspection tools. - A single, unified way to handle prompts, responses, and metrics. --------- Co-authored-by: Clémentine Fourrier <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: [email protected] <[email protected]>
What does this PR do?
This PR gives the prompt building logic in lighteval a much-needed spring cleaning
The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥
Highlights
samplingMethod
(generative or loglikelihood). Say goodbye touse_case
and all those old request types.Doc
directly -—no more unnecessaryrequest
wrappers that were bloating the code.ModelResponse
type, whether generative or loglikelihood. This means simpler logging and metric computation.compute(doc: Doc, model_response: ModelResponse)
.Why?
architecture of lighteval
Example details dataset