feat: Add auto_alignment_evaluation_of_llm_output.ipynb via upload#1929
Conversation
There was a problem hiding this comment.
Hello @jenniferliangc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request introduces a new Jupyter notebook, auto_alignment_evaluation_of_llm_output.ipynb, designed to evaluate the performance of Large Language Models (LLMs) against ground truth data. The notebook implements a customizable, line-by-line automated evaluator, focusing on use cases requiring high precision. It includes two main evaluation phases: a Rephraser Evaluator (semantic similarity check) and a Final Answer Evaluator (source, ingredient, and instruction sentence scoring). The notebook also handles edge cases where both the ground truth and LLM indicate an inability to answer a question.
Highlights
- New Notebook: Adds
auto_alignment_evaluation_of_llm_output.ipynbfor automated LLM output evaluation. - Rephraser Evaluator: Implements semantic similarity scoring between LLM output and ground truth rephrased queries.
- Final Answer Evaluator: Introduces a scoring system based on source selection, ingredient sentence similarity, and instruction sentence similarity, with penalties for missed or extra information.
- Edge Case Handling: Addresses scenarios where both ground truth and LLM indicate an inability to provide an answer.
Changelog
Click here to see the changelog
- gemini/evaluation/auto_alignment_evaluation_of_llm_output.ipynb
- Initial commit of the notebook.
- Adds a markdown cell with links to open the notebook in Colab, Colab Enterprise, Vertex AI Workbench, and GitHub.
- Includes a code cell to generate the links to open the notebook in various platforms.
- Adds an author attribution section.
- Implements Rephraser Evaluator using semantic similarity.
- Implements Final Answer Evaluator with source scoring and ingredient/instruction scoring.
- Adds functions to extract recipe numbers, ingredients, and instructions from text.
- Includes functions to calculate semantic similarity and evaluate content.
- Adds logic to handle cases where no answer is available.
- Adds code to save the final results to a CSV file.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A notebook ascends,
LLM's truth it now defends,
Precision's keen eye,
No falsehoods slip by,
Eval's new chapter extends.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
The notebook introduces a new method for evaluating LLM performance, focusing on automated alignment evaluation. It covers rephraser and final answer evaluators, including source, ingredient sentence, and instruction sentence scoring. The notebook is well-structured and provides a comprehensive approach to LLM evaluation. However, there are some areas that could be improved for clarity and robustness.
Summary of Findings
- TODO Comment: The notebook contains a TODO comment in a code cell that generates an HTML table. This cell should be removed before the notebook is finalized.
- Missing Model Definition: The code to calculate cosine similarity uses a variable
modelwithout defining it. This will cause an error when the code is executed. - Exception Handling: The
evaluate_contentfunction includes broad exception handling that catches any errors during the evaluation process. While this prevents the notebook from crashing, it may mask important issues that should be addressed.
Merge Readiness
The notebook provides a valuable tool for automated LLM evaluation. However, the critical issue of the missing model definition must be addressed before merging. The high severity issue related to broad exception handling should also be addressed. I am unable to approve this pull request, and recommend that it not be merged until these issues are resolved. Users should have others review and approve this code before merging.
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "df['rephraser_semantic_similarity'] = df.apply(lambda row: calculate_similarity(row['gemini_final_answer_llm'], row['ground_truth_rephrased_query']), axis=1)" |
There was a problem hiding this comment.
The variable model is used here without being defined within this function or globally. This will cause a NameError when the code is executed. You need to initialize the model variable with a SentenceTransformer model before calling model.encode.[^1]
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2') # Or any other model
embeddings1 = model.encode(llm_rephraser, convert_to_tensor=True)
Style Guide References
| } | ||
| ], | ||
| "source": [ | ||
| "# TODO: REMOVE THIS CELL FROM YOUR NOTEBOOK ###\n", |
There was a problem hiding this comment.
| " except Exception as e: # Broad exception handling to catch any errors\n", | ||
| " print(f\"Error in case {casenum}: {e}\") # Log the error for debugging\n", |
There was a problem hiding this comment.
This broad exception handling catches any errors during the evaluation process. While this prevents the notebook from crashing, it may mask important issues that should be addressed. Consider using more specific exception handling to catch and log particular errors, or re-raise the exception after logging it.[^1]
except Exception as e: # Broad exception handling to catch any errors
print(f"Error in case {casenum}: {e}") # Log the error for debugging
raise e # re-raise the exception
Style Guide References
holtskinner
left a comment
There was a problem hiding this comment.
@jenniferliangc Please resolve int and spelling errors. Once they are addressed, I will review.
|
Closing due to inactivity. Please re-open once the issues are resolved. |
Description
Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
CONTRIBUTINGGuide.CODEOWNERSfor the file(s).nox -s formatfrom the repository root to format).Fixes #<issue_number_goes_here> 🦕