-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat: add batch evaluation method for pipelines #2942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Agnieszka Marzec <[email protected]>
Co-authored-by: Agnieszka Marzec <[email protected]>
|
@agnieszka-m Thank you for the detailed feedback. All the change requests are addressed now. |
tstadel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks already quite nice! I left some comments that I think need to be fixed before merging:
- there seems to be a small problem with
add_doc_meta_data_to_answerwithin the reader implementation - adding sas values and sorting the dataframe should not both be done in a method called
_add_sas_to_eval_result - there are some duplicate tests that can be deleted (if I didn't oversee something)
I think it's fine to not have them here, as they are only needed by non-standard pipelines till now. |
vblagoje
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments @julian-risch Not sure if they warrant any changes but I'll leave the state in Request changes
| top_k: Optional[int] = None, | ||
| batch_size: Optional[int] = None, | ||
| labels: Optional[List[MultiLabel]] = None, | ||
| add_isolated_node_eval: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this parameter add_isolated_node_eval? As a user of this API it wasn't clear to me immediately what it is about and why do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need it. It's the same parameter as in the standard run(). If it is set to True, the evaluation is executed with labels as node inputs in addition to the integrated evaluation, where the node inputs are the outputs of the previous node in the pipeline.
| params: Optional[dict] = None, | ||
| sas_model_name_or_path: Optional[str] = None, | ||
| sas_batch_size: int = 32, | ||
| sas_use_gpu: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| context_matching_boost_split_overlaps=context_matching_boost_split_overlaps, | ||
| context_matching_min_length=context_matching_min_length, | ||
| context_matching_threshold=context_matching_threshold, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the same parameters for both function calls - not sure if it makes sense to create a dict and then unpack the dict in the two method calls. Maybe that's a bad practice, but it makes the code more compact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting thought. In this case here, I would leave it as is because it's part of the user-facing function pipeline.eval_batch that users can call directly and it occurs only twice here. I think listing all the parameters is more intuitive and easier to understand for users than having some dictionary/custom datastructure that they first need to understand. If it was a function that is used internally only or if it occurred much more often, we could make it more compact with your suggestion yes, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, good points.
haystack/pipelines/base.py
Outdated
| if params is None: | ||
| params = {} | ||
| else: | ||
| params = params.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about one-liner params = {} if params is None else params.copy()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 👍
vblagoje
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase and it LGTM @julian-risch
Change requests are addressed. Thank you for your feedback! @tstadel
Related Issue(s):
closes #2636
Proposed changes:
pipeline.eval_batchmethod_build_eval_dataframe_from_batchesmethod that calls_build_eval_dataframeinternally_build_eval_dataframeis very complex already (the method comprises >300 lines of code and would need some refactoring to be simplified)_add_sas_to_eval_resultto avoid code duplicationtest/pipelines/test_eval_batch.pyand make them usepipeline.eval_batchuse_batch_modeoption toexecute_eval_runwith default set toFalseuntilpipeline.eval_batchis always faster aspipeline.evalLimitations:
num_processesormax_processesto 1.Up for discussion: ShouldWe decided no, it's not needed at the moment.standard_pipelines.evalandstandard_pipelines.eval_batchhave a documents parameter that they pass on?run_batchdoes not support different filters (or more generally speaking any different params per query) and thuseval_batchcannot support filters that differ per query and its label. Thus, labels must not have filters, for example test casetest_extractive_qa_labels_with_filterswon’t work with eval_batchCurrently the following tests are commented out because they are expected to fail due to other issues:
test_extractive_qa_eval_translationbecause of TranslationWrapperPipeline returns wrong query format in debug mode #2964test_qa_multi_retriever_pipeline_evalbecause of Incomplete debug results for pipeline with two retrievers #2962test_multi_retriever_pipeline_evalbecause of Incomplete debug results for pipeline with two retrievers #2962test_multi_retriever_pipeline_with_asymmetric_qa_evalbecause of Incomplete debug results for pipeline with two retrievers #2962Pre-flight checklist
I have enabled actions on my fork