Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(agents-api): added mmr to chat #1013

Merged
merged 8 commits into from
Jan 4, 2025
Merged

Conversation

Vedantsahai18
Copy link
Member

@Vedantsahai18 Vedantsahai18 commented Jan 3, 2025

PR Type

Enhancement


Description

  • Introduced Maximal Marginal Relevance (MMR) to refine document selection.

  • Filtered document references to ensure valid embeddings for MMR.

  • Added conditions to enable MMR based on recall options.

  • Integrated numpy for handling embeddings in MMR implementation.


Changes walkthrough 📝

Relevant files
Enhancement
gather_messages.py
Implement MMR for document selection in chat                         

agents-api/agents_api/queries/chat/gather_messages.py

  • Added MMR implementation to refine document selection.
  • Filtered document references without embeddings.
  • Integrated numpy for embedding handling.
  • Applied MMR based on recall options and limits.
  • +27/-1   

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information


    Important

    Introduces Maximal Marginal Relevance (MMR) in gather_messages.py for improved document selection in chat based on recall options.

    • Behavior:
      • Adds MMR to gather_messages() in gather_messages.py for document selection based on recall options.
      • Filters doc_references to ensure embeddings are present before applying MMR.
      • Uses maximal_marginal_relevance() to select document indices based on query and document embeddings.
    • Imports:
      • Adds imports for maximal_marginal_relevance and numpy in gather_messages.py.
    • Misc:
      • Adds get_history import in utils.py.
      • Adds hasattr checks in lifespan() in app.py for state attributes.
      • Fixes return logic in aembedding() in litellm.py.

    This description was created by Ellipsis for be69c60. It will automatically update as commits are pushed.

    Copy link
    Contributor

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Potential Bug

    The MMR filtering of documents without embeddings happens after the limit check, which could lead to fewer documents than the limit being returned even when more valid documents are available

    doc_references = [
        doc for doc in doc_references if doc.snippet.embedding is not None
    ]
    Code Duplication

    The comment "Apply MMR" appears twice in succession, and the second MMR application could be combined with the indices calculation for better readability

    indices = maximal_marginal_relevance(
        np.asarray(query_embedding),
        [doc.snippet.embedding for doc in doc_references],
        k=recall_options.limit,
    )
    # Apply MMR
    doc_references = [
        doc for i, doc in enumerate(doc_references) if i in set(indices)
    ]

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Reviewed everything up to c00d08d in 24 seconds

    More details
    • Looked at 45 lines of code in 1 files
    • Skipped 0 files when reviewing.
    • Skipped posting 3 drafted comments based on config settings.
    1. agents-api/agents_api/queries/chat/gather_messages.py:149
    • Draft comment:
      Ensure that filtering out references without embeddings is the intended behavior, as it might result in an empty list if none have embeddings.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The code uses a list comprehension to filter doc_references based on the presence of embeddings. This is a temporary fix, but it should be noted that this could potentially remove all references if none have embeddings, which might not be the intended behavior.
    2. agents-api/agents_api/queries/chat/gather_messages.py:155
    • Draft comment:
      Ensure query_embedding is always a list or array-like to avoid potential errors with np.asarray.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The code uses np.asarray(query_embedding) which assumes query_embedding is a list or array-like. If query_embedding is not in the expected format, this could raise an error.
    3. agents-api/agents_api/queries/chat/gather_messages.py:161
    • Draft comment:
      Consider using a list comprehension with a condition check instead of set(indices) for potentially better performance.
    • Reason this comment was not posted:
      Confidence changes required: 33%
      The code uses set(indices) to filter doc_references. This is efficient but could be optimized by using a list comprehension directly with a condition check.

    Workflow ID: wflow_RJnb9YudrHKaruPp


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    Copy link
    Contributor

    qodo-merge-pro-for-open-source bot commented Jan 3, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    ✅ Add error handling for embedding processing to prevent crashes from malformed data
    Suggestion Impact:The commit added a check for valid embeddings by ensuring there are at least 2 documents with embeddings before applying MMR, which helps prevent crashes from malformed data

    code diff:

    +            and len([doc for doc in doc_references if doc.snippet.embedding is not None]) >= 2

    Add error handling for potential numpy array conversion failures when processing
    embeddings, as some embeddings might be malformed or have unexpected formats.

    agents-api/agents_api/queries/chat/gather_messages.py [154-158]

    -indices = maximal_marginal_relevance(
    -    np.asarray(query_embedding),
    -    [doc.snippet.embedding for doc in doc_references],
    -    k=recall_options.limit,
    -)
    +try:
    +    indices = maximal_marginal_relevance(
    +        np.asarray(query_embedding),
    +        [doc.snippet.embedding for doc in doc_references],
    +        k=recall_options.limit,
    +    )
    +except (ValueError, TypeError) as e:
    +    logger.error(f"Failed to process embeddings for MMR: {e}")
    +    return doc_references[:recall_options.limit]
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion adds crucial error handling for potential numpy array conversion failures, which could prevent application crashes in production. This is particularly important as the code deals with external data that might be malformed.

    8
    ✅ Validate embedding dimensions to prevent array processing errors
    Suggestion Impact:The commit adds validation for MMR processing by checking if there are at least 2 documents with non-null embeddings before applying MMR

    code diff:

    +        if (
    +            recall_options.mmr_strength > 0
    +            and len(doc_references) > recall_options.limit
    +            and recall_options.mode != "text"
    +            and len([doc for doc in doc_references if doc.snippet.embedding is not None]) >= 2

    Validate the dimensions of embeddings match between query and documents before
    processing to prevent numpy dimension mismatch errors.

    agents-api/agents_api/queries/chat/gather_messages.py [154-158]

    +query_emb = np.asarray(query_embedding)
    +doc_embeddings = [doc.snippet.embedding for doc in doc_references]
    +if not all(len(emb) == len(query_emb) for emb in doc_embeddings):
    +    logger.error("Embedding dimensions mismatch")
    +    return doc_references[:recall_options.limit]
     indices = maximal_marginal_relevance(
    -    np.asarray(query_embedding),
    -    [doc.snippet.embedding for doc in doc_references],
    +    query_emb,
    +    doc_embeddings,
         k=recall_options.limit,
     )
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: This suggestion adds critical validation of embedding dimensions, which is essential for preventing numpy dimension mismatch errors that could occur during MMR processing. Such errors could silently break the functionality.

    8
    ✅ Add validation to ensure sufficient valid documents exist before applying document selection algorithm

    Validate that there are enough documents with valid embeddings before applying MMR
    to avoid potential index errors or empty results.

    agents-api/agents_api/queries/chat/gather_messages.py [139-146]

     if (
         recall_options.mmr_strength > 0
         and len(doc_references) > recall_options.limit
         and recall_options.mode != "text"
    +    and len([doc for doc in doc_references if doc.snippet.embedding is not None]) >= 2
     ):

    [Suggestion has been applied]

    Suggestion importance[1-10]: 7

    Why: The suggestion adds an important validation check to ensure there are enough valid documents before applying MMR, preventing potential errors from insufficient data. This complements the existing filtering of documents without embeddings.

    7

    Copy link
    Contributor

    qodo-merge-pro-for-open-source bot commented Jan 3, 2025

    CI Failure Feedback 🧐

    (Checks updated until commit e92fb4d)

    Action: Lint-And-Format

    Failed stage: Run stefanzweifel/git-auto-commit-action@v4 [❌]

    Failure summary:

    The action failed during the git auto-commit process because:

  • The push to branch f/add-mmr-pg was rejected due to the local branch being behind its remote
    counterpart
  • Git returned a non-fast-forward error, indicating that the remote branch has commits that aren't
    present in the local branch
  • The error suggests running 'git pull' to integrate remote changes before pushing again

  • Relevant error logs:
    1:  ##[group]Operating System
    2:  Ubuntu
    ...
    
    1256:  �[36;1muv run poe lint�[0m
    1257:  shell: /usr/bin/bash -e {0}
    1258:  env:
    1259:  UV_CACHE_DIR: /home/runner/work/_temp/setup-uv-cache
    1260:  ##[endgroup]
    1261:  �[37mPoe =>�[0m �[94mruff format�[0m
    1262:  1 file reformatted, 283 files left unchanged
    1263:  �[37mPoe =>�[0m �[94mruff check�[0m
    1264:  Fixed 1 error:
    1265:  - agents_api/activities/task_steps/base_evaluate.py:
    1266:  1 × F841 (unused-variable)
    1267:  Found 1 error (1 fixed, 0 remaining).
    ...
    
    1301:  * [new branch]      f/entries-summarization -> origin/f/entries-summarization
    1302:  * [new branch]      f/grafana-traefik       -> origin/f/grafana-traefik
    1303:  * [new branch]      f/increase-test-coverage -> origin/f/increase-test-coverage
    1304:  * [new branch]      f/integrations-sentry   -> origin/f/integrations-sentry
    1305:  * [new branch]      f/prompt-step-run-tools -> origin/f/prompt-step-run-tools
    1306:  * [new branch]      f/remote-browser-args-connect-url -> origin/f/remote-browser-args-connect-url
    1307:  * [new branch]      f/scispace-poc          -> origin/f/scispace-poc
    1308:  * [new branch]      f/switch-to-pg          -> origin/f/switch-to-pg
    1309:  * [new branch]      f/tasks-validation-errors-enhancement -> origin/f/tasks-validation-errors-enhancement
    ...
    
    1326:  [f/add-mmr-pg 21d0e33] refactor: Lint agents-api (CI)
    1327:  Author: Vedantsahai18 <[email protected]>
    1328:  1 file changed, 2 insertions(+), 4 deletions(-)
    1329:  INPUT_TAGGING_MESSAGE: 
    1330:  No tagging message supplied. No tag will be added.
    1331:  INPUT_PUSH_OPTIONS: 
    1332:  To https://github.com/julep-ai/julep
    1333:  ! [rejected]        f/add-mmr-pg -> f/add-mmr-pg (non-fast-forward)
    1334:  error: failed to push some refs to 'https://github.com/julep-ai/julep'
    1335:  hint: Updates were rejected because the tip of your current branch is behind
    1336:  hint: its remote counterpart. If you want to integrate the remote changes,
    1337:  hint: use 'git pull' before pushing again.
    1338:  hint: See the 'Note about fast-forwards' in 'git push --help' for details.
    1339:  Error: Invalid status code: 1
    1340:  at ChildProcess.<anonymous> (/home/runner/work/_actions/stefanzweifel/git-auto-commit-action/v4/index.js:17:19)
    1341:  at ChildProcess.emit (node:events:519:28)
    1342:  at maybeClose (node:internal/child_process:1105:16)
    1343:  at ChildProcess._handle.onexit (node:internal/child_process:305:5) {
    1344:  code: 1
    1345:  }
    1346:  Error: Invalid status code: 1
    ...
    
    1352:  [command]/usr/bin/git version
    1353:  git version 2.47.1
    1354:  Temporarily overriding HOME='/home/runner/work/_temp/c70d4abf-f37c-477d-95f3-0ba15fe68f0a' before making global git config changes
    1355:  Adding repository directory to the temporary git global config as a safe directory
    1356:  [command]/usr/bin/git config --global --add safe.directory /home/runner/work/julep/julep
    1357:  [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
    1358:  [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
    1359:  fatal: No url found for submodule path 'drafts/cozo' in .gitmodules
    1360:  ##[warning]The process '/usr/bin/git' failed with exit code 128
    

    ✨ CI feedback usage guide:

    The CI feedback tool (/checks) automatically triggers when a PR has a failed check.
    The tool analyzes the failed checks and provides several feedbacks:

    • Failed stage
    • Failed test name
    • Failure summary
    • Relevant error logs

    In addition to being automatically triggered, the tool can also be invoked manually by commenting on a PR:

    /checks "https://github.com/{repo_name}/actions/runs/{run_number}/job/{job_number}"
    

    where {repo_name} is the name of the repository, {run_number} is the run number of the failed check, and {job_number} is the job number of the failed check.

    Configuration options

    • enable_auto_checks_feedback - if set to true, the tool will automatically provide feedback when a check is failed. Default is true.
    • excluded_checks_list - a list of checks to exclude from the feedback, for example: ["check1", "check2"]. Default is an empty list.
    • enable_help_text - if set to true, the tool will provide a help message with the feedback. Default is true.
    • persistent_comment - if set to true, the tool will overwrite a previous checks comment with the new feedback. Default is true.
    • final_update_message - if persistent_comment is true and updating a previous checks message, the tool will also create a new message: "Persistent checks updated to latest commit". Default is true.

    See more information about the checks tool in the docs.

    Vedantsahai18 and others added 2 commits January 3, 2025 12:18
    Co-authored-by: qodo-merge-pro-for-open-source[bot] <189517486+qodo-merge-pro-for-open-source[bot]@users.noreply.github.com>
    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Incremental review on 5608d8e in 31 seconds

    More details
    • Looked at 1045 lines of code in 14 files
    • Skipped 0 files when reviewing.
    • Skipped posting 2 drafted comments based on config settings.
    1. agents-api/agents_api/queries/chat/gather_messages.py:80
    • Draft comment:
      The check for if recall: is redundant because the subsequent check if recall and recall_options: already covers the case where recall is False. Consider removing the first check to simplify the logic.
    • Reason this comment was not posted:
      Comment was not on a valid diff hunk.
    2. agents-api/agents_api/queries/chat/gather_messages.py:161
    • Draft comment:
      Using set(indices) is unnecessary since indices is already a list of unique values. You can directly use indices in the list comprehension.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      In gather_messages.py, the code uses set(indices) which is unnecessary since indices is already a list of unique values. This can be simplified by directly using indices.

    Workflow ID: wflow_O9uUJyIGRVtjV8eW


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Incremental review on 6f1521c in 46 seconds

    More details
    • Looked at 446 lines of code in 9 files
    • Skipped 0 files when reviewing.
    • Skipped posting 1 drafted comments based on config settings.
    1. agents-api/agents_api/activities/utils.py:81
    • Draft comment:
      Consider using yaml.safe_load(s) instead of yaml.load(s) to avoid potential security risks associated with loading YAML data.
    • Reason this comment was not posted:
      Comment was on unchanged code.

    Workflow ID: wflow_TKOTdxHdYcqvWyvN


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Incremental review on 9c12ad3 in 1 minute and 1 seconds

    More details
    • Looked at 588 lines of code in 6 files
    • Skipped 0 files when reviewing.
    • Skipped posting 1 drafted comments based on config settings.
    1. cookbooks/04-hook-generator-trending-reels.py:1
    • Draft comment:
      The PR description mentions changes in gather_messages.py related to MMR, but the file is not visible in the diff. Please ensure all relevant files are included in the PR.
    • Reason this comment was not posted:
      Comment did not seem useful.

    Workflow ID: wflow_zdtXJKX4W4HTvAEv


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Incremental review on 7412ca2 in 14 seconds

    More details
    • Looked at 136 lines of code in 5 files
    • Skipped 0 files when reviewing.
    • Skipped posting 1 drafted comments based on config settings.
    1. agents-api/agents_api/queries/executions/list_execution_transitions.py:63
    • Draft comment:
      The @pg_query(debug=True) decorator was removed. If debugging is still needed, consider adding a logging mechanism or re-adding the debug flag.
    • Reason this comment was not posted:
      Confidence changes required: 50%
      The decorator @pg_query(debug=True) was removed, which might affect debugging capabilities. This change should be noted.

    Workflow ID: wflow_dSYhlVyan1bxfsMp


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👍 Looks good to me! Incremental review on be69c60 in 16 seconds

    More details
    • Looked at 16 lines of code in 1 files
    • Skipped 0 files when reviewing.
    • Skipped posting 0 drafted comments based on config settings.

    Workflow ID: wflow_AgYgsQJU0Y35L178


    You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    @HamadaSalhab HamadaSalhab merged commit dfc9af3 into f/switch-to-pg Jan 4, 2025
    11 checks passed
    @HamadaSalhab HamadaSalhab deleted the f/add-mmr-pg branch January 4, 2025 05:51
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants