-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: add batch evaluation method for pipelines #2942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
9d1f2dd
add basic pipeline.eval_batch for qa without filters
julian-risch 4a398a3
Merge branch 'master' into batch-eval
julian-risch 59b64c3
black formatting
julian-risch 789cc2d
pydoc-markdown
julian-risch 9a6c681
remove batch eval tests failing due to bugs
julian-risch 671826a
remove comment
julian-risch ad40a55
explain commented out tests
julian-risch 1bda44b
avoid code duplication
julian-risch 7bacd93
black
julian-risch 8cb7a51
mypy
julian-risch 5cf8f0a
pydoc markdown
julian-risch b3b57f5
add batch option to execute_eval_run
julian-risch 9aad547
pydoc markdown
julian-risch 313a5e8
Merge branch 'master' into batch-eval
julian-risch 2738994
Apply documentation suggestions from code review
julian-risch d184d20
Apply documentation suggestion from code review
julian-risch 07682eb
add documentation based on review comments
julian-risch a2d4d6f
Merge branch 'batch-eval' of github.com:deepset-ai/haystack into batc…
julian-risch 195f8a1
black
julian-risch 17f750d
black
julian-risch 16076ea
schema updates
julian-risch 7339b45
remove duplicate tests
julian-risch 0fa1bcb
add separate method for column reordering
julian-risch a1ac6b4
merge _build_eval_dataframe methods
julian-risch afd03a5
pylint ignore in function
julian-risch 4b0b242
change type annotation of queries to list only
julian-risch ab4dccb
one-liner addressing review comment on params dict
julian-risch af434ee
black
julian-risch 16cf698
markdown files updated
julian-risch File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -217,6 +217,65 @@ def eval( | |
| ) | ||
| return output | ||
|
|
||
| def eval_batch( | ||
| self, | ||
| labels: List[MultiLabel], | ||
| params: Optional[dict] = None, | ||
| sas_model_name_or_path: Optional[str] = None, | ||
| sas_batch_size: int = 32, | ||
| sas_use_gpu: bool = True, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| add_isolated_node_eval: bool = False, | ||
| custom_document_id_field: Optional[str] = None, | ||
| context_matching_min_length: int = 100, | ||
| context_matching_boost_split_overlaps: bool = True, | ||
| context_matching_threshold: float = 65.0, | ||
| ) -> EvaluationResult: | ||
|
|
||
| """ | ||
| Evaluates the pipeline by running the pipeline once per query in the debug mode | ||
| and putting together all data that is needed for evaluation, for example, calculating metrics. | ||
|
|
||
| To calculate SAS (Semantic Answer Similarity) metrics, specify `sas_model_name_or_path`. | ||
|
|
||
| You can control the scope within which an Answer or a Document is considered correct afterwards (see `document_scope` and `answer_scope` params in `EvaluationResult.calculate_metrics()`). | ||
| For some of these scopes, you need to add the following information during `eval()`: | ||
| - `custom_document_id_field` parameter to select a custom document ID from document's metadata for ID matching (only affects 'document_id' scopes). | ||
| - `context_matching_...` parameter to fine-tune the fuzzy matching mechanism that determines whether text contexts match each other (only affects 'context' scopes, default values should work most of the time). | ||
|
|
||
| :param labels: The labels to evaluate on. | ||
| :param params: Parameters for the `retriever` and `reader`. For instance, | ||
| params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}. | ||
| :param sas_model_name_or_path: Sentence transformers semantic textual similarity model you want to use for the SAS value calculation. | ||
| It should be a path or a string pointing to downloadable models. | ||
| :param sas_batch_size: Number of prediction label pairs to encode at once by cross encoder or sentence transformer while calculating SAS. | ||
| :param sas_use_gpu: Whether to use a GPU or the CPU for calculating semantic answer similarity. | ||
| Falls back to CPU if no GPU is available. | ||
| :param add_isolated_node_eval: Whether to additionally evaluate the reader based on labels as input, instead of the output of the previous node in the pipeline. | ||
| :param custom_document_id_field: Custom field name within `Document`'s `meta` which identifies the document and is used as a criterion for matching documents to labels during evaluation. | ||
| This is especially useful if you want to match documents on other criteria (for example, file names) than the default document IDs, as these could be heavily influenced by preprocessing. | ||
| If not set, the default `Document`'s `id` is used as the criterion for matching documents to labels. | ||
| :param context_matching_min_length: The minimum string length context and candidate need to have to be scored. | ||
| Returns 0.0 otherwise. | ||
| :param context_matching_boost_split_overlaps: Whether to boost split overlaps (for example, [AB] <-> [BC]) that result from different preprocessing parameters. | ||
| If we detect that the score is near a half match and the matching part of the candidate is at its boundaries, | ||
| we cut the context on the same side, recalculate the score, and take the mean of both. | ||
| Thus [AB] <-> [BC] (score ~50) gets recalculated with B <-> B (score ~100) scoring ~75 in total. | ||
| :param context_matching_threshold: Score threshold that candidates must surpass to be included into the result list. Range: [0,100] | ||
| """ | ||
| output = self.pipeline.eval_batch( | ||
| labels=labels, | ||
| params=params, | ||
| sas_model_name_or_path=sas_model_name_or_path, | ||
| sas_batch_size=sas_batch_size, | ||
| sas_use_gpu=sas_use_gpu, | ||
| add_isolated_node_eval=add_isolated_node_eval, | ||
| custom_document_id_field=custom_document_id_field, | ||
| context_matching_boost_split_overlaps=context_matching_boost_split_overlaps, | ||
| context_matching_min_length=context_matching_min_length, | ||
| context_matching_threshold=context_matching_threshold, | ||
| ) | ||
| return output | ||
|
|
||
| def print_eval_report( | ||
| self, | ||
| eval_result: EvaluationResult, | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this parameter
add_isolated_node_eval? As a user of this API it wasn't clear to me immediately what it is about and why do we need it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need it. It's the same parameter as in the standard
run(). If it is set toTrue, the evaluation is executed with labels as node inputs in addition to the integrated evaluation, where the node inputs are the outputs of the previous node in the pipeline.