feat: add SDK support for search records with similarity #4023

jfcalvo · 2023-10-23T11:23:49Z

Description

This PR suggest an implementation to give support to /api/me/datasets/{dataset_id}/records/search endpoint with vector similarity searching.

Here is a full example working with similarity search from python SDK using no real 100% fake data:

rg.init(api_key=owner.api_key)
workspace = Workspace.create(name="test")

feedback_dataset.add_vector_settings(VectorSettings(name="vector", dimensions=2))

feedback_dataset.add_records(
    [
        FeedbackRecord(fields={"text": "hello"}, vectors={"vector": [1, 1]}),
        FeedbackRecord(
            fields={"text": "hello"},
            vectors={"vector": [2, 2]},
            responses=[ResponseSchema(status="discarded")],
        ),
        FeedbackRecord(
            fields={"text": "hello"},
            vectors={"vector": [4, 4]},
            responses=[ResponseSchema(status="submitted", values={"question": {"value": "answer"}})],
        ),
    ]
)

remote = feedback_dataset.push_to_argilla("test_find_similar_records", workspace=workspace)
only_submitted_and_discarded_records = remote.filter_by(response_status=["submitted", "discarded"])

records_with_scores = only_submitted_and_discarded_records.find_similar_records(
    vector_name="vector",
    value=[1, 1],
    max_results=3,
)
# Even if the used vector is the same that contains the record 0, the search will skip it since it has no submitted or discarded responses.

Closes #4020

Type of change

New feature (non-breaking change which adds functionality)

How Has This Been Tested

Tested locally

Checklist

I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I filled out the contributor form (see text above)
I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

src/argilla/client/sdk/v1/datasets/api.py

src/argilla/client/feedback/dataset/remote/dataset.py

…ure/add-sdk-search-records

src/argilla/client/feedback/dataset/remote/dataset.py

tests/unit/client/feedback/dataset/remote/test_dataset.py

alvarobartt

As already mentioned here I think we should use a client schema indeed, not the SDK one

alvarobartt

WDYT about using pydantic schemas in the SDK instead of just the kwargs? @frascuchon @gabrielmbmb

alvarobartt

WDYT about using pydantic schemas in the SDK instead of kwargs? @frascuchon @gabrielmbmb

src/argilla/client/feedback/dataset/remote/dataset.py

…ure/add-sdk-search-records

…de an implementation for both local and remote

CHANGELOG.md

for more information, see https://pre-commit.ci

github-actions · 2023-11-02T16:57:05Z

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4023-ki24f765kq-no.a.run.app

tests/integration/client/sdk/v1/test_datasets.py

for more information, see https://pre-commit.ci

codecov · 2023-11-03T11:06:38Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Files	Coverage Δ
.../argilla/client/feedback/schemas/remote/records.py	`88.88% <100.00%> (+0.11%)`	⬆️
src/argilla/client/sdk/commons/errors.py	`90.90% <100.00%> (+0.16%)`	⬆️
src/argilla/client/sdk/v1/datasets/api.py	`92.23% <100.00%> (+0.78%)`	⬆️
src/argilla/client/sdk/v1/datasets/models.py	`100.00% <100.00%> (ø)`
src/argilla/server/search_engine/elasticsearch.py	`93.10% <ø> (+20.68%)`	⬆️
src/argilla/server/search_engine/opensearch.py	`48.21% <ø> (-42.86%)`	⬇️
src/argilla/client/feedback/dataset/base.py	`86.20% <75.00%> (-0.35%)`	⬇️
...c/argilla/client/feedback/dataset/local/dataset.py	`85.97% <60.00%> (-0.90%)`	⬇️
.../argilla/client/feedback/dataset/remote/dataset.py	`92.14% <85.71%> (-0.31%)`	⬇️

... and 10 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

# Description This PR changes the `l2_norm` distance to the cosine similarity for vector search. This change can improve results on similarity searches and also for least similarity searches. This [PR](#4023) must be reviewed first Closes #4123 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor (change restructuring the codebase without changing functionality) - [X] Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) The base dataset has been used with boh ElasticSearch and OpenSearch to verify this change. **Checklist** - [ ] I added relevant documentation - [ ] I followed the style guidelines of this project - [ ] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

alvarobartt

Left some minor comments, I'll complete the review once those are addressed, nice job!

src/argilla/client/feedback/dataset/remote/dataset.py

src/argilla/client/sdk/v1/datasets/api.py

src/argilla/client/sdk/v1/datasets/models.py

src/argilla/client/sdk/v1/datasets/api.py

src/argilla/client/feedback/dataset/remote/dataset.py

tests/unit/client/sdk/models/v1/test_search_records.py

Co-authored-by: Alvaro Bartolome <[email protected]>

…ure/add-sdk-search-records

…o/argilla into feature/add-sdk-search-records

…/new-filter-area-ui * feature/feedback-dataset-semantic-similarity: feat: add sdk update records with vectors (#4128) ⚡️ Remove unused args ✨ Remove unused requests feat: add `delete_vectors_settings` method (#4130) feat: add SDK support for search records with similarity (#4023) ✨ No capitalized name for Fields, Questions, Metadata and Vectors tab. feat: add `update_vectors_settings` function (#4122)

jfcalvo added 2 commits October 23, 2023 13:09

feat: add functionality draft to add new similarity search records

1521073

feat: add correct URL to search_records

a7a7bf4

jfcalvo requested review from frascuchon, gabrielmbmb and alvarobartt October 23, 2023 11:23

feat: add some missing logic

3aa6755

jfcalvo linked an issue Oct 23, 2023 that may be closed by this pull request

[FEATURE] Add find_similar_records in the Python SDK #4020

Closed

frascuchon reviewed Oct 24, 2023

View reviewed changes

src/argilla/client/sdk/v1/datasets/api.py Outdated Show resolved Hide resolved

frascuchon reviewed Oct 24, 2023

View reviewed changes

src/argilla/client/feedback/dataset/remote/dataset.py Outdated Show resolved Hide resolved

jfcalvo added 4 commits October 24, 2023 12:48

Merge branch 'feature/feedback-dataset-semantic-similarity' into feat…

30e3471

…ure/add-sdk-search-records

feat: add some improvements and tests

7a00966

Merge branch 'feature/feedback-dataset-semantic-similarity' into feat…

1b630e2

…ure/add-sdk-search-records

feat: align implementation with current product design

84bcecd

jfcalvo commented Oct 25, 2023

View reviewed changes

src/argilla/client/feedback/dataset/remote/dataset.py Show resolved Hide resolved

jfcalvo commented Oct 25, 2023

View reviewed changes

src/argilla/client/feedback/dataset/remote/dataset.py Show resolved Hide resolved

jfcalvo commented Oct 25, 2023

View reviewed changes

tests/unit/client/feedback/dataset/remote/test_dataset.py Outdated Show resolved Hide resolved

alvarobartt reviewed Oct 25, 2023

View reviewed changes

frascuchon reviewed Oct 30, 2023

View reviewed changes

src/argilla/client/feedback/dataset/remote/dataset.py Outdated Show resolved Hide resolved

frascuchon added 7 commits November 2, 2023 12:51

Merge branch 'feature/feedback-dataset-semantic-similarity' into feat…

c42d4a1

…ure/add-sdk-search-records

refactor: Define find_similar_records as an abstract method and provi…

752fdfd

…de an implementation for both local and remote

refactor: clean API request models and prepare body code block

f38bdd8

chore: add missing method for test base class

7309c25

tests: Adding more tests

d4aa9d6

chore: Review type hints mismatch

9c95392

chore: Update changelog

de86af8

frascuchon self-assigned this Nov 2, 2023

frascuchon requested review from frascuchon November 2, 2023 16:17

frascuchon requested a review from alvarobartt November 2, 2023 16:17

frascuchon marked this pull request as ready for review November 2, 2023 16:17

frascuchon reviewed Nov 2, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

frascuchon and others added 2 commits November 2, 2023 17:18

Apply suggestions from code review

959402c

[pre-commit.ci] auto fixes from pre-commit.com hooks

23783d0

for more information, see https://pre-commit.ci

frascuchon mentioned this pull request Nov 2, 2023

feat: Using cosine similarity #4124

Merged

11 tasks

jfcalvo commented Nov 3, 2023

View reviewed changes

tests/integration/client/sdk/v1/test_datasets.py Outdated Show resolved Hide resolved

frascuchon and others added 3 commits November 3, 2023 11:32

chore: Review client Api error exceptions

d6659f0

tests: create unit tests checking search_records calls

db921e3

[pre-commit.ci] auto fixes from pre-commit.com hooks

bb6104e

for more information, see https://pre-commit.ci

alvarobartt reviewed Nov 3, 2023

View reviewed changes

gabrielmbmb approved these changes Nov 3, 2023

View reviewed changes

src/argilla/client/sdk/v1/datasets/api.py Show resolved Hide resolved

src/argilla/client/feedback/dataset/remote/dataset.py Show resolved Hide resolved

tests/unit/client/sdk/models/v1/test_search_records.py Outdated Show resolved Hide resolved

frascuchon and others added 6 commits November 3, 2023 17:35

chore: move file

fb2fb0f

Update src/argilla/client/feedback/dataset/remote/dataset.py

2b05c6d

Co-authored-by: Alvaro Bartolome <[email protected]>

Update src/argilla/client/sdk/v1/datasets/api.py

463e720

Co-authored-by: Alvaro Bartolome <[email protected]>

Merge branch 'feature/feedback-dataset-semantic-similarity' into feat…

2811fe5

…ure/add-sdk-search-records

chore: Review signature

0193177

Merge branch 'feature/add-sdk-search-records' of github.com:argilla-i…

33a8ddf

…o/argilla into feature/add-sdk-search-records

frascuchon merged commit 237f488 into feature/feedback-dataset-semantic-similarity Nov 3, 2023

frascuchon deleted the feature/add-sdk-search-records branch November 3, 2023 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SDK support for search records with similarity #4023

feat: add SDK support for search records with similarity #4023

jfcalvo commented Oct 23, 2023 •

edited by frascuchon

Loading

alvarobartt left a comment

alvarobartt left a comment

alvarobartt left a comment

github-actions bot commented Nov 2, 2023

codecov bot commented Nov 3, 2023 •

edited

Loading

alvarobartt left a comment

feat: add SDK support for search records with similarity #4023

feat: add SDK support for search records with similarity #4023

Conversation

jfcalvo commented Oct 23, 2023 • edited by frascuchon Loading

Description

alvarobartt left a comment

Choose a reason for hiding this comment

alvarobartt left a comment

Choose a reason for hiding this comment

alvarobartt left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 2, 2023

codecov bot commented Nov 3, 2023 • edited Loading

Codecov Report

alvarobartt left a comment

Choose a reason for hiding this comment

jfcalvo commented Oct 23, 2023 •

edited by frascuchon

Loading

codecov bot commented Nov 3, 2023 •

edited

Loading