Skip to content

feat: Add auto_alignment_evaluation_of_llm_output.ipynb via upload#1929

Closed
jenniferliangc wants to merge 2 commits intoGoogleCloudPlatform:mainfrom
jenniferliangc:auto_alignment_evaluation_of_llm_output_jenniferliangc
Closed

feat: Add auto_alignment_evaluation_of_llm_output.ipynb via upload#1929
jenniferliangc wants to merge 2 commits intoGoogleCloudPlatform:mainfrom
jenniferliangc:auto_alignment_evaluation_of_llm_output_jenniferliangc

Conversation

@jenniferliangc
Copy link

Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Follow the CONTRIBUTING Guide.
  • You are listed as the author in your notebook or README file.
    • Your account is listed in CODEOWNERS for the file(s).
  • Make your Pull Request title in the https://www.conventionalcommits.org/ specification.
  • Ensure the tests and linter pass (Run nox -s format from the repository root to format).
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@jenniferliangc jenniferliangc requested a review from a team as a code owner April 2, 2025 21:16
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @jenniferliangc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a new Jupyter notebook, auto_alignment_evaluation_of_llm_output.ipynb, designed to evaluate the performance of Large Language Models (LLMs) against ground truth data. The notebook implements a customizable, line-by-line automated evaluator, focusing on use cases requiring high precision. It includes two main evaluation phases: a Rephraser Evaluator (semantic similarity check) and a Final Answer Evaluator (source, ingredient, and instruction sentence scoring). The notebook also handles edge cases where both the ground truth and LLM indicate an inability to answer a question.

Highlights

  • New Notebook: Adds auto_alignment_evaluation_of_llm_output.ipynb for automated LLM output evaluation.
  • Rephraser Evaluator: Implements semantic similarity scoring between LLM output and ground truth rephrased queries.
  • Final Answer Evaluator: Introduces a scoring system based on source selection, ingredient sentence similarity, and instruction sentence similarity, with penalties for missed or extra information.
  • Edge Case Handling: Addresses scenarios where both ground truth and LLM indicate an inability to provide an answer.

Changelog

Click here to see the changelog
  • gemini/evaluation/auto_alignment_evaluation_of_llm_output.ipynb
    • Initial commit of the notebook.
    • Adds a markdown cell with links to open the notebook in Colab, Colab Enterprise, Vertex AI Workbench, and GitHub.
    • Includes a code cell to generate the links to open the notebook in various platforms.
    • Adds an author attribution section.
    • Implements Rephraser Evaluator using semantic similarity.
    • Implements Final Answer Evaluator with source scoring and ingredient/instruction scoring.
    • Adds functions to extract recipe numbers, ingredients, and instructions from text.
    • Includes functions to calculate semantic similarity and evaluate content.
    • Adds logic to handle cases where no answer is available.
    • Adds code to save the final results to a CSV file.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A notebook ascends,
LLM's truth it now defends,
Precision's keen eye,
No falsehoods slip by,
Eval's new chapter extends.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The notebook introduces a new method for evaluating LLM performance, focusing on automated alignment evaluation. It covers rephraser and final answer evaluators, including source, ingredient sentence, and instruction sentence scoring. The notebook is well-structured and provides a comprehensive approach to LLM evaluation. However, there are some areas that could be improved for clarity and robustness.

Summary of Findings

  • TODO Comment: The notebook contains a TODO comment in a code cell that generates an HTML table. This cell should be removed before the notebook is finalized.
  • Missing Model Definition: The code to calculate cosine similarity uses a variable model without defining it. This will cause an error when the code is executed.
  • Exception Handling: The evaluate_content function includes broad exception handling that catches any errors during the evaluation process. While this prevents the notebook from crashing, it may mask important issues that should be addressed.

Merge Readiness

The notebook provides a valuable tool for automated LLM evaluation. However, the critical issue of the missing model definition must be addressed before merging. The high severity issue related to broad exception handling should also be addressed. I am unable to approve this pull request, and recommend that it not be merged until these issues are resolved. Users should have others review and approve this code before merging.

},
"outputs": [],
"source": [
"df['rephraser_semantic_similarity'] = df.apply(lambda row: calculate_similarity(row['gemini_final_answer_llm'], row['ground_truth_rephrased_query']), axis=1)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable model is used here without being defined within this function or globally. This will cause a NameError when the code is executed. You need to initialize the model variable with a SentenceTransformer model before calling model.encode.[^1]

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2') # Or any other model
embeddings1 = model.encode(llm_rephraser, convert_to_tensor=True)

Style Guide References

}
],
"source": [
"# TODO: REMOVE THIS CELL FROM YOUR NOTEBOOK ###\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This cell is marked as a TODO and should be removed from the final notebook. It appears to be generating an HTML table for displaying links to open the notebook in various platforms. Ensure this functionality is either moved elsewhere or deemed unnecessary.[^1]

Style Guide References

Comment on lines +1164 to +1165
" except Exception as e: # Broad exception handling to catch any errors\n",
" print(f\"Error in case {casenum}: {e}\") # Log the error for debugging\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This broad exception handling catches any errors during the evaluation process. While this prevents the notebook from crashing, it may mask important issues that should be addressed. Consider using more specific exception handling to catch and log particular errors, or re-raise the exception after logging it.[^1]

    except Exception as e:  # Broad exception handling to catch any errors
        print(f"Error in case {casenum}: {e}")  # Log the error for debugging
        raise e # re-raise the exception

Style Guide References

Copy link
Collaborator

@holtskinner holtskinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenniferliangc Please resolve int and spelling errors. Once they are addressed, I will review.

@holtskinner
Copy link
Collaborator

Closing due to inactivity. Please re-open once the issues are resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants