Skip to content

feat: add SDK support for search records with similarity #4023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jfcalvo
Copy link
Contributor

@jfcalvo jfcalvo commented Oct 23, 2023

Description

This PR suggest an implementation to give support to /api/me/datasets/{dataset_id}/records/search endpoint with vector similarity searching.

Here is a full example working with similarity search from python SDK using no real 100% fake data:

rg.init(api_key=owner.api_key)
workspace = Workspace.create(name="test")

feedback_dataset.add_vector_settings(VectorSettings(name="vector", dimensions=2))

feedback_dataset.add_records(
    [
        FeedbackRecord(fields={"text": "hello"}, vectors={"vector": [1, 1]}),
        FeedbackRecord(
            fields={"text": "hello"},
            vectors={"vector": [2, 2]},
            responses=[ResponseSchema(status="discarded")],
        ),
        FeedbackRecord(
            fields={"text": "hello"},
            vectors={"vector": [4, 4]},
            responses=[ResponseSchema(status="submitted", values={"question": {"value": "answer"}})],
        ),
    ]
)

remote = feedback_dataset.push_to_argilla("test_find_similar_records", workspace=workspace)
only_submitted_and_discarded_records = remote.filter_by(response_status=["submitted", "discarded"])

records_with_scores = only_submitted_and_discarded_records.find_similar_records(
    vector_name="vector",
    value=[1, 1],
    max_results=3,
)
# Even if the used vector is the same that contains the record 0, the search will skip it since it has no submitted or discarded responses.

Closes #4020

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested

Tested locally

Checklist

  • I added relevant documentation
  • follows the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I filled out the contributor form (see text above)
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@jfcalvo jfcalvo linked an issue Oct 23, 2023 that may be closed by this pull request
Copy link
Contributor

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already mentioned here I think we should use a client schema indeed, not the SDK one

Copy link
Contributor

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about using pydantic schemas in the SDK instead of just the kwargs? @frascuchon @gabrielmbmb

Copy link
Contributor

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about using pydantic schemas in the SDK instead of kwargs? @frascuchon @gabrielmbmb

@frascuchon frascuchon self-assigned this Nov 2, 2023
@frascuchon frascuchon marked this pull request as ready for review November 2, 2023 16:17
Copy link

github-actions bot commented Nov 2, 2023

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4023-ki24f765kq-no.a.run.app

@frascuchon frascuchon mentioned this pull request Nov 2, 2023
11 tasks
Copy link

codecov bot commented Nov 3, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Files Coverage Δ
.../argilla/client/feedback/schemas/remote/records.py 88.88% <100.00%> (+0.11%) ⬆️
src/argilla/client/sdk/commons/errors.py 90.90% <100.00%> (+0.16%) ⬆️
src/argilla/client/sdk/v1/datasets/api.py 92.23% <100.00%> (+0.78%) ⬆️
src/argilla/client/sdk/v1/datasets/models.py 100.00% <100.00%> (ø)
src/argilla/server/search_engine/elasticsearch.py 93.10% <ø> (+20.68%) ⬆️
src/argilla/server/search_engine/opensearch.py 48.21% <ø> (-42.86%) ⬇️
src/argilla/client/feedback/dataset/base.py 86.20% <75.00%> (-0.35%) ⬇️
...c/argilla/client/feedback/dataset/local/dataset.py 85.97% <60.00%> (-0.90%) ⬇️
.../argilla/client/feedback/dataset/remote/dataset.py 92.14% <85.71%> (-0.31%) ⬇️

... and 10 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

This PR changes the `l2_norm` distance to the cosine similarity for
vector search. This change can improve results on similarity searches
and also for least similarity searches.
 
This [PR](#4023) must be
reviewed first

Closes #4123

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor (change restructuring the codebase without changing
functionality)
- [X] Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

The base dataset has been used with boh ElasticSearch and OpenSearch to
verify this change.

**Checklist**

- [ ] I added relevant documentation
- [ ] I followed the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [ ] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Copy link
Contributor

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments, I'll complete the review once those are addressed, nice job!

@frascuchon frascuchon merged commit 237f488 into feature/feedback-dataset-semantic-similarity Nov 3, 2023
@frascuchon frascuchon deleted the feature/add-sdk-search-records branch November 3, 2023 16:47
leiyre pushed a commit that referenced this pull request Nov 6, 2023
…/new-filter-area-ui

* feature/feedback-dataset-semantic-similarity:
  feat: add sdk update records with vectors (#4128)
  ⚡️ Remove unused args
  ✨ Remove unused requests
  feat: add `delete_vectors_settings` method (#4130)
  feat: add SDK support for search records with similarity (#4023)
  ✨ No capitalized name for Fields, Questions, Metadata and Vectors tab.
  feat: add `update_vectors_settings` function (#4122)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add find_similar_records in the Python SDK
4 participants