feat: Add auto_alignment_evaluation_of_llm_output.ipynb via upload by jenniferliangc · Pull Request #1929 · GoogleCloudPlatform/generative-ai

jenniferliangc · 2025-04-02T21:16:51Z

Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Follow the CONTRIBUTING Guide.
You are listed as the author in your notebook or README file.
- Your account is listed in CODEOWNERS for the file(s).
Make your Pull Request title in the https://www.conventionalcommits.org/ specification.
Ensure the tests and linter pass (Run nox -s format from the repository root to format).
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

gemini-code-assist

Hello @jenniferliangc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a new Jupyter notebook, auto_alignment_evaluation_of_llm_output.ipynb, designed to evaluate the performance of Large Language Models (LLMs) against ground truth data. The notebook implements a customizable, line-by-line automated evaluator, focusing on use cases requiring high precision. It includes two main evaluation phases: a Rephraser Evaluator (semantic similarity check) and a Final Answer Evaluator (source, ingredient, and instruction sentence scoring). The notebook also handles edge cases where both the ground truth and LLM indicate an inability to answer a question.

Highlights

New Notebook: Adds auto_alignment_evaluation_of_llm_output.ipynb for automated LLM output evaluation.
Rephraser Evaluator: Implements semantic similarity scoring between LLM output and ground truth rephrased queries.
Final Answer Evaluator: Introduces a scoring system based on source selection, ingredient sentence similarity, and instruction sentence similarity, with penalties for missed or extra information.
Edge Case Handling: Addresses scenarios where both ground truth and LLM indicate an inability to provide an answer.

Changelog

Click here to see the changelog

gemini/evaluation/auto_alignment_evaluation_of_llm_output.ipynb
- Initial commit of the notebook.
- Adds a markdown cell with links to open the notebook in Colab, Colab Enterprise, Vertex AI Workbench, and GitHub.
- Includes a code cell to generate the links to open the notebook in various platforms.
- Adds an author attribution section.
- Implements Rephraser Evaluator using semantic similarity.
- Implements Final Answer Evaluator with source scoring and ingredient/instruction scoring.
- Adds functions to extract recipe numbers, ingredients, and instructions from text.
- Includes functions to calculate semantic similarity and evaluate content.
- Adds logic to handle cases where no answer is available.
- Adds code to save the final results to a CSV file.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

A notebook ascends,
LLM's truth it now defends,
Precision's keen eye,
No falsehoods slip by,
Eval's new chapter extends.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The notebook introduces a new method for evaluating LLM performance, focusing on automated alignment evaluation. It covers rephraser and final answer evaluators, including source, ingredient sentence, and instruction sentence scoring. The notebook is well-structured and provides a comprehensive approach to LLM evaluation. However, there are some areas that could be improved for clarity and robustness.

Summary of Findings

TODO Comment: The notebook contains a TODO comment in a code cell that generates an HTML table. This cell should be removed before the notebook is finalized.
Missing Model Definition: The code to calculate cosine similarity uses a variable model without defining it. This will cause an error when the code is executed.
Exception Handling: The evaluate_content function includes broad exception handling that catches any errors during the evaluation process. While this prevents the notebook from crashing, it may mask important issues that should be addressed.

Merge Readiness

The notebook provides a valuable tool for automated LLM evaluation. However, the critical issue of the missing model definition must be addressed before merging. The high severity issue related to broad exception handling should also be addressed. I am unable to approve this pull request, and recommend that it not be merged until these issues are resolved. Users should have others review and approve this code before merging.

gemini-code-assist · 2025-04-02T21:17:18Z

gemini/evaluation/auto_alignment_evaluation_of_llm_output.ipynb

+   },
+   "outputs": [],
+   "source": [
+    "df['rephraser_semantic_similarity'] = df.apply(lambda row: calculate_similarity(row['gemini_final_answer_llm'], row['ground_truth_rephrased_query']), axis=1)"


The variable model is used here without being defined within this function or globally. This will cause a NameError when the code is executed. You need to initialize the model variable with a SentenceTransformer model before calling model.encode.[^1]

from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-mpnet-base-v2') # Or any other model embeddings1 = model.encode(llm_rephraser, convert_to_tensor=True)

Style Guide References

gemini-code-assist · 2025-04-02T21:17:18Z

gemini/evaluation/auto_alignment_evaluation_of_llm_output.ipynb

+    }
+   ],
+   "source": [
+    "# TODO: REMOVE THIS CELL FROM YOUR NOTEBOOK ###\n",


This cell is marked as a TODO and should be removed from the final notebook. It appears to be generating an HTML table for displaying links to open the notebook in various platforms. Ensure this functionality is either moved elsewhere or deemed unnecessary.[^1]

Style Guide References

gemini-code-assist · 2025-04-02T21:17:18Z

gemini/evaluation/auto_alignment_evaluation_of_llm_output.ipynb

+    "    except Exception as e:  # Broad exception handling to catch any errors\n",
+    "        print(f\"Error in case {casenum}: {e}\")  # Log the error for debugging\n",


This broad exception handling catches any errors during the evaluation process. While this prevents the notebook from crashing, it may mask important issues that should be addressed. Consider using more specific exception handling to catch and log particular errors, or re-raise the exception after logging it.[^1]

except Exception as e: # Broad exception handling to catch any errors print(f"Error in case {casenum}: {e}") # Log the error for debugging raise e # re-raise the exception

Style Guide References

holtskinner

@jenniferliangc Please resolve int and spelling errors. Once they are addressed, I will review.

…iferliangc

holtskinner · 2025-05-12T16:02:33Z

Closing due to inactivity. Please re-open once the issues are resolved.

Add files via upload

21a9169

jenniferliangc requested a review from a team as a code owner April 2, 2025 21:16

gemini-code-assist bot reviewed Apr 2, 2025

View reviewed changes

gemini-code-assist bot suggested changes Apr 2, 2025

View reviewed changes

holtskinner mentioned this pull request Apr 3, 2025

Update CODEOWNERS #1928

Closed

6 tasks

holtskinner assigned jenniferliangc Apr 3, 2025

holtskinner requested changes Apr 4, 2025

View reviewed changes

Merge branch 'main' into auto_alignment_evaluation_of_llm_output_jenn…

081ced0

…iferliangc

holtskinner closed this May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add auto_alignment_evaluation_of_llm_output.ipynb via upload#1929

feat: Add auto_alignment_evaluation_of_llm_output.ipynb via upload#1929
jenniferliangc wants to merge 2 commits intoGoogleCloudPlatform:mainfrom
jenniferliangc:auto_alignment_evaluation_of_llm_output_jenniferliangc

jenniferliangc commented Apr 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 2, 2025

Uh oh!

gemini-code-assist bot Apr 2, 2025

Uh oh!

gemini-code-assist bot Apr 2, 2025

Uh oh!

holtskinner left a comment

Uh oh!

holtskinner commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		" except Exception as e: # Broad exception handling to catch any errors\n",
		" print(f\"Error in case {casenum}: {e}\") # Log the error for debugging\n",

Conversation

jenniferliangc commented Apr 2, 2025

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Apr 2, 2025

Choose a reason for hiding this comment

Style Guide References

Uh oh!

gemini-code-assist bot Apr 2, 2025

Choose a reason for hiding this comment

Style Guide References

Uh oh!

gemini-code-assist bot Apr 2, 2025

Choose a reason for hiding this comment

Style Guide References

Uh oh!

holtskinner left a comment

Choose a reason for hiding this comment

Uh oh!

holtskinner commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants