Skip to content

feat: add PATCH /api/v1/dataset/{dataset_id}/records endpoint #3934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

gabrielmbmb
Copy link
Contributor

@gabrielmbmb gabrielmbmb commented Oct 11, 2023

Description

This PR adds the following:

  • Updates PATCH /api/v1/records/{record_id} added in feat: add PATCH /api/v1/records/{record_id} endpoint #3920 allowing also to update the suggestions of a record. The suggestions in the input payload will replace the old suggestions.
  • Add new PATCH /api/v1/datasets/{dataset_id}/records endpoint allowing to batch/bulk update the records of a dataset. The endpoint allow to update the same attributes from the record as in the PATCH /api/v1/records/{record_id} endpoint.
  • Slightly modify the SearchDocument getter dict to do not try to populate the SearchDocument.responses attribute if the relationship has not been loaded (this allows us to not to have to load the Record.responses when updating the record document in the SearchEngine using add_records)
  • Removes SearchEngine.update_record_metadata method as the same logic is covered by SearchEngine.add_records method, which can be also used to update the fields of an existing document.
  • Rename SearchEngine.add_records method to SearchEngine.index_records as it can be used to both add and update records.

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested

I made an small benchmark to test the latency of the new endpoint. I've created a dataset with 100000 records and all the possible questions and metadata properties. Then I built batches of 1000 records, updating all the responses and metadata fields, and sent them to the API. The average response time of the bulk PATCH endpoint was ~= 0.8 seconds.

Code used for benchmark
import uuid
import random
import argilla as rg

LABELS = ["a", "b", "c"]
RANKS = ["top-1", "top-2", "top-3"]

dataset = rg.FeedbackDataset(
  fields=[rg.TextField(name="text")],
  questions=[
      rg.TextQuestion(name="text"),
      rg.RatingQuestion(name="rating", values=[1, 2, 3, 4, 5]),
      rg.LabelQuestion(name="label", labels=LABELS),
      rg.MultiLabelQuestion(name="multi-label", labels=LABELS),
      rg.RankingQuestion(name="ranking", values=RANKS),
  ],
  metadata_properties=[
      rg.TermsMetadataProperty(name="label", values=LABELS),
      rg.IntegerMetadataProperty(name="integer", min=0, max=10),
      rg.FloatMetadataProperty(name="float", min=0, max=10),
  ],
)

dataset.add_records([rg.FeedbackRecord(fields={"text": "Hello"}, metadata={"extra": "yes"}) for _ in range(100000)])

remote = dataset.push_to_argilla(name=f"benchmark-{uuid.uuid4()}", workspace="gabriel")


def random_rank_order():
  ranks = RANKS.copy()
  ranks.sort(key=lambda x: random.random())
  return [{"value": rank, "rank": i + 1} for i, rank in enumerate(ranks)]


def build_update_payload(record):
  return {
      "id": str(record.id),
      "external_id": str(uuid.uuid4()),
      "metadata": {
          "label": random.choice(["a", "b", "c"]),
          "integer": random.randint(0, 10),
          "float": random.uniform(0, 10),
      },
      "suggestions": [
          {"question_id": str(remote.questions[0].id), "value": "hello world" * random.randint(1, 15)},
          {"question_id": str(remote.questions[1].id), "value": random.randint(1, 5)},
          {"question_id": str(remote.questions[2].id), "value": random.choice(["a", "b", "c"])},
          {"question_id": str(remote.questions[3].id), "value": [random.choice(["a", "b", "c"])]},
          {"question_id": str(remote.questions[4].id), "value": random_rank_order()},
      ],
  }

http_client = rg.active_client().http_client

elapseds = []

batch = []
for record in remote.records:
  batch.append(build_update_payload(record))

  if len(batch) == 1000:
      response = http_client.httpx.patch(f"/api/v1/datasets/{remote.id}/records", json={"items": batch})
      elapseds.append(response.elapsed.total_seconds())
      batch = []

average_elapsed_time = sum(elapseds) / len(elapseds)

print("Average elapsed time", average_elapsed_time)

Checklist

  • I added relevant documentation
  • I followed the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I filled out the contributor form (see text above)
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@gabrielmbmb gabrielmbmb changed the base branch from develop to feature/support-for-metadata-filtering-and-sorting October 11, 2023 16:57
@gabrielmbmb gabrielmbmb force-pushed the feature/patch-record-bulk-endpoint branch from 1e0f957 to a80cd01 Compare October 12, 2023 16:46
@gabrielmbmb gabrielmbmb self-assigned this Oct 16, 2023
@gabrielmbmb gabrielmbmb added type: enhancement Indicates new feature requests area: api Indicates that an issue or pull request is related to the Fast API server or REST endpoints labels Oct 16, 2023
@gabrielmbmb gabrielmbmb added this to the v1.17.0 milestone Oct 16, 2023
@gabrielmbmb gabrielmbmb marked this pull request as ready for review October 16, 2023 13:41
@gabrielmbmb gabrielmbmb force-pushed the feature/patch-record-bulk-endpoint branch from 5ebb32c to 62cac52 Compare October 16, 2023 13:42
@gabrielmbmb gabrielmbmb force-pushed the feature/patch-record-bulk-endpoint branch from 62cac52 to 34bbc94 Compare October 16, 2023 13:42
@github-actions
Copy link

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-3934-ki24f765kq-no.a.run.app

@gabrielmbmb gabrielmbmb force-pushed the feature/patch-record-bulk-endpoint branch from edd1a37 to abc8596 Compare October 17, 2023 10:26
@gabrielmbmb gabrielmbmb merged commit f6e7766 into feature/support-for-metadata-filtering-and-sorting Oct 17, 2023
@gabrielmbmb gabrielmbmb deleted the feature/patch-record-bulk-endpoint branch October 17, 2023 12:03
@frascuchon frascuchon modified the milestones: v1.17.0, v1.18.0 Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: api Indicates that an issue or pull request is related to the Fast API server or REST endpoints type: enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants