Releases: argilla-io/argilla
v1.26.0
🔆 Release highlights
Spans question
We've added a new type of question to Feedback Datasets: the SpanQuestion
. This type of question allows you to highlight portions of text in a specific field and apply a label. It is specially useful for token classification (like NER or POS tagging) and information extraction tasks.
spans_demo.mp4
With this type of question you can:
✨ Provide suggested spans with a confidence score, so your team doesn't need to start from scratch.
⌨️ Choose a label using your mouse or with the keyboard shortcut provided next to the label.
🖱️ Draw a span by dragging your mouse over the parts of the text you want to select or if it's a single token, just double-click on it.
🪄 Forget about mistakes with token boundaries. The UI will snap your spans to token boundaries for you.
🔎 Annotate at character-level when you need more fine-grained spans. Hold the Shift
key while drawing the span and the resulting span will start and end in the exact boundaries of your selection.
✔️ Quickly change the label of a span by clicking on the label name and selecting the correct one from the dropdown.
🖍️ Correct a span at the speed of light by simply drawing the correct span over it. The new span will overwrite the old one.
🧼 Remove labels by hovering over the label name in the span and then click on the 𐢫 on the left hand side.
Here's an example of what your dataset would look like from the SDK:
import argilla as rg
from argilla.client.feedback.schemas import SpanValueSchema
#connect to your Argilla instance
rg.init(...)
# create a dataset with a span question
dataset = rg.FeedbackDataset(
fields=[rg.TextField(name="text"),
questions=[
rg.SpanQuestion(
name="entities",
title="Highlight the entities in the text:",
labels={"PER": "Person", "ORG": "Organization", "EVE": "Event"}, # or ["PER", "ORG", "EVE"]
field="text", # the field where you want to do the span annotation
required=True
)
]
)
# create a record with suggested spans
record = rg.FeedbackRecord(
fields={"text": "This is the text of the record"}
suggestions = [
{
"question_name": "entities",
"value": [
SpanValueSchema(
start=0, # position of the first character of the span
end=10, # position of the character right after the end of the span
label="ORG",
score=1.0
)
],
"agent": "my_model",
}
]
)
# add records to the dataset and push to Argilla
dataset.add_records([record])
dataset.push_to_argilla(...)
To learn more about this and all the other questions available in Feedback Datasets, check out our documentation on:
Changelog 1.26.0
Added
- If you expand the labels of a
single or multi
label Question, the state is maintained during the entire annotation process. (#4630) - Added support for span questions in the Python SDK. (#4617)
- Added support for span values in suggestions and responses. (#4623)
- Added
span
questions forFeedbackDataset
. (#4622) - Added
ARGILLA_CACHE_DIR
environment variable to configure the client cache directory. (#4509)
Fixed
- Fixed contextualized workspaces. (#4665)
- Fixed prepare for training when passing
RankingValueSchema
instances to suggestions. (#4628) - Fixed parsing ranking values in suggestions from HF datasets. (#4629)
- Fixed reading description from API response payload. (#4632)
- Fixed pulling (n*chunk_size)+1 records when using
ds.pull
or iterating over the dataset. (#4662) - Fixed client's resolution of enum values when calling the Search and Metrics api, to support Python >=3.11 enum handling. (#4672)
New Contributors
- @davidefiocco made their first contribution in #4639
Full Changelog: v1.25.0...v1.26.0
v1.25.0
🔆 Release highlights
Reorder labels
admin
and owner
users can now change the order in which labels appear in the question form. To do this, go to the Questions
tab inside Dataset Settings and move the labels until they are in the desired order.
reorder_labels.mp4
Aligned SDK status filter
The missing
status has been removed from the SDK filters. To filter records that don't have responses you will now need to use the pending
status like so:
filtered_dataset = dataset.filter_by(response_status="pending")
Learn more about how to use this filter in our docs
Pandas 2.0 support
We’ve removed the limitation to use pandas <2.0.0
so you can now use Argilla with pandas v1 or v2 safely.
Changelog 1.25.0
Note
For changes in the argilla-server module, visit the argilla-server release notes
Added
- Reorder labels in
dataset settings page
for single/multi label questions (#4598) - Added pandas v2 support using the python SDK. (#4600)
Removed
- Removed
missing
response for status filter. Usepending
instead. (#4533)
Fixed
- Fixed FloatMetadataProperty: value is not a valid float (#4570)
- Fixed redirect to
user-settings
instead of 404user_settings
(#4609)
New Contributors
Full Changelog: v1.24.0....v1.25.0
v1.24.0
Note
This release does not contain any new features, but it includes a major change in the argilla server.
The package is using the argilla-server
dependency defined here.
Full Changelog: v1.23.1...v1.24.0
v1.23.1
1.23.1
Fixed
- Fixed Responsive view for Feedback Datasets. (#4579)
New Contributors
- @CpHaddock made their first contribution at #4484
- @julien-c made their first contribution in #4582
Full Changelog: v1.23.0...v1.23.1
v1.23.0
🔆 Release highlights
Hugging Face OAuth
You can now set up OAuth in your Argilla Hugging Face spaces. This is a simple way to have your team members or collaborators in crowdsourced projects sign in and log in to your space using their Hugging face accounts.
To learn how to set up Hugging Face OAuth for your Argilla Space, go to our docs.
Bulk actions for filter results
We’ve added an improvement for our bulk view so you can perform actions on all results from a filter (or a combination of them!).
To use this, go to the bulk view and apply some filter(s) of your choice. If the results are more than the records seen in the current page, when you click the checkbox you will see the option to select all of the results. Then, you can give responses, discard, save a draft and even submit all of the records at once!
Embed PDFs in a TextField
We’ve added the pdf_to_html
function in our utilities so you can easily embed a PDF reader within a TextField using markdown.
This function accepts either the file path, the URLs or the file's byte data and returns the corresponding HTML to render the PDF within the Argilla user interface.
Learn more about how to use this feature here.
Changelog 1.23.0
Added
- Added bulk annotation by filter criteria. (#4516)
- Automatically fetch new datasets on focus tab. (#4514)
- API v1 responses returning
Record
schema now always includedataset_id
as attribute. (#4482) - API v1 responses returning
Response
schema now always includerecord_id
as attribute. (#4482) - API v1 responses returning
Question
schema now always includedataset_id
attribute. (#4487) - API v1 responses returning
Field
schema now always includedataset_id
attribute. (#4488) - API v1 responses returning
MetadataProperty
schema now always includedataset_id
attribute. (#4489) - API v1 responses returning
VectorSettings
schema now always includedataset_id
attribute. (#4490) - Added
pdf_to_html
function to.html_utils
module that convert PDFs to dataURL to be able to render them in tha Argilla UI. (#4481) - Added
ARGILLA_AUTH_SECRET_KEY
environment variable. (#4539) - Added
ARGILLA_AUTH_ALGORITHM
environment variable. (#4539) - Added
ARGILLA_AUTH_TOKEN_EXPIRATION
environment variable. (#4539) - Added
ARGILLA_AUTH_OAUTH_CFG
environment variable. (#4546) - Added OAuth2 support for HuggingFace Hub. (#4546)
Deprecated
- Deprecated
ARGILLA_LOCAL_AUTH_*
environment variables. Will be removed in the release v1.25.0. (#4539)
Changed
- Changed regex pattern for
username
attribute inUserCreate
. Now uppercase letters are allowed. (#4544)
Removed
- Remove sending
Authorization
header from python SDK requests. (#4535)
Fixed
- Fixed keyboard shortcut for label questions. (#4530)
New Contributors
Full Changelog: v1.22.0...v1.23.0
v1.22.0
🔆 Release Highlights
Bulk actions in Feedback Task datasets
Our signature bulk actions are now available for Feedback datasets!
Bulk.in.Feedback.mp4
Switch between Focus and Bulk depending on your needs:
- In the Focus view, you can navigate and respond to records individually. This is ideal for closely examining and giving responses to each record.
- The Bulk view allows you to see multiple records on the same page. You can select all or some of them and perform actions in bulk, such as applying a label, saving responses, submitting, or discarding. You can use this feature along with filters and similarity search to process a list of records in bulk.
For now, this is only available in the Pending queue, but rest assured, bulk actions will be improved and extended to other queues in upcoming releases.
Read more about our Focus and Bulk views here.
Sorting rating values
We now support sorting records in the Argilla UI based on the values of Rating questions (both suggestions and responses):
Learn about this and other filters in our docs.
Out-of-the-box embedding support
It’s now easier than ever to add vector embeddings to your records with the new Sentence Transformers integration.
Just choose a model from the Hugging Face hub and use our SentenceTransformersExtractor
to add vectors to your dataset:
import argilla as rg
from argilla.client.feedback.integrations.sentencetransformers import SentenceTransformersExtractor
# Connect to Argilla
rg.init(
api_url="http://localhost:6900",
api_key="owner.apikey",
workspace="my_workspace"
)
# Initialize the SentenceTransformersExtractor
ste = SentenceTransformersExtractor(
model = "TaylorAI/bge-micro-v2", # Use a model from https://huggingface.co/models?library=sentence-transformers
show_progress = False,
)
# Load a dataset from your Argilla instance
ds_remote = rg.FeedbackDataset.from_argilla("my_dataset")
# Update the dataset
ste.update_dataset(
dataset=ds_remote,
fields=["context"], # Only update the context field
update_records=True, # Update the records in the dataset
overwrite=False, # Overwrite existing fields
)
Learn more about this functionality in this tutorial.
Changelog 1.22.0
Added
- Added Bulk annotation support. (#4333)
- Restore filters from feedback dataset settings. (#4461)
- Warning on feedback dataset settings when leaving page with unsaved changes. (#4461)
- Added pydantic v2 support using the python SDK. (#4459)
- Added
vector_settings
to the__repr__
method of theFeedbackDataset
andRemoteFeedbackDataset
. (#4454) - Added integration for
sentence-transformers
usingSentenceTransformersExtractor
to configurevector_settings
inFeedbackDataset
andFeedbackRecord
. (#4454)
Changed
- Module
argilla.cli.server
definitions have been moved toargilla.server.cli
module. (#4472) - [breaking] Changed
vector_settings_by_name
for genericproperty_by_name
usage, which will returnNone
instead of raising an error. (#4454) - The constant definition
ES_INDEX_REGEX_PATTERN
in moduleargilla._constants
is now private. (#4472) nan
values in metadata properties will raise a 422 error when creating/updating records. (#4300)None
values are now allowed in metadata properties. (#4300)
Fixed
- Paginating to a new record, automatically scrolls down to selected form area. (#4333)
Deprecated
- The
missing
response status for filtering records is deprecated and will be removed in the release v1.24.0. Usepending
instead. (#4433)
Removed
- The deprecated
python -m argilla database
command has been removed. (#4472)
New Contributors
- @Piyush-Kumar-Ghosh made their first contribution in #4463
Full Changelog: v1.21.0...v1.22.0
v1.21.0
🔆 Release highlights
Draft queue
We’ve added a new queue in the Feedback Task UI so that you can save your drafts and have them all together in a separate view. This allows you to save your responses and come back to them before submission.
Note that responses won’t be autosaved now and to save your changes you will need to click on “Save as draft” or use the shortcut command ⌘
+ S
(macOS), Ctrl
+ S
(other).
Improved shortcuts
We’ve been working to improve the keyboard shortcuts within the Feedback Task UI to make them more productive and user-friendly.
You can now select labels in Label and Multi-label questions using the numerical keys in your keyboard. To know which number corresponds with each label you can simply show or hide helpers by pressing command ⌘
(MacOS) or Ctrl
(other) for 2 seconds. You will then see the numbers next to the corresponding labels.
We’ve also simplified shortcuts for navigation and actions, so that they use as few keys as possible.
Check all available shortcuts here.
New metrics
module
We've added a new module to analyze the annotations, both in terms of agreement between the annotators and in terms of data and model drift monitoring.
Agreement metrics
Easily measure the inter-annotator agreement to explore the quality of the annotation guidelines and consistency between annotators:
import argilla as rg
from argilla.client.feedback.metrics import AgreementMetric
feedback_dataset = rg.FeedbackDataset.from_argilla("...", workspace="...")
metric = AgreementMetric(dataset=feedback_dataset, question_name="question_name")
agreement_metrics = metric.compute("alpha")
#>>> agreement_metrics
#[AgreementMetricResult(metric_name='alpha', count=1000, result=0.467889)]
Read more here.
Model metrics
You can use ModelMetric
to model monitor performance for data and model drift:
import argilla as rg
from argilla.client.feedback.metrics import ModelMetric
feedback_dataset = rg.FeedbackDataset.from_argilla("...", workspace="...")
metric = ModelMetric(dataset=feedback_dataset, question_name="question_name")
annotator_metrics = metric.compute("accuracy")
#>>> annotator_metrics
#{'00000000-0000-0000-0000-000000000001': [ModelMetricResult(metric_name='accuracy', count=3, result=0.5)], '00000000-0000-0000-0000-000000000002': [ModelMetricResult(metric_name='accuracy', count=3, result=0.25)], '00000000-0000-0000-0000-000000000003': [ModelMetricResult(metric_name='accuracy', count=3, result=0.5)]}
Read more here.
List aggregation support for TermsMetadataProperty
You can now pass a list of terms within a record’s metadata that will be aggregated and filterable as part of a TermsMetadataProperty
.
Here is an example:
import argilla as rg
dataset = rg.FeedbackDataset(
fields = ...,
questions = ...,
metadata_properties = [rg.TermsMetadataProperty(name="annotators")]
)
record = rg.FeedbackRecord(
fields = ...,
metadata = {"annotators": ["user_1", "user_2"]}
)
Reindex from CLI
Reindex all entities in your Argilla instance (datasets, records, responses, etc.) with a simple CLI command.
argilla server reindex
This is useful when you are working with an existing feedback datasets and you want to update the search engine info.
Changelog 1.21.0
Added
- Added new draft queue for annotation view (#4334)
- Added annotation metrics module for the
FeedbackDataset
(argilla.client.feedback.metrics
). (#4175). - Added strategy to handle and translate errors from the server for
401
HTTP status code` (#4362) - Added integration for
textdescriptives
usingTextDescriptivesExtractor
to configuremetadata_properties
inFeedbackDataset
andFeedbackRecord
. (#4400). Contributed by @m-newhauser - Added
POST /api/v1/me/responses/bulk
endpoint to create responses in bulk for current user. (#4380) - Added list support for term metadata properties. (Closes #4359)
- Added new CLI task to reindex datasets and records into the search engine. (#4404)
- Added
httpx_extra_kwargs
argument torg.init
andArgilla
to allow passing extra arguments tohttpx.Client
used byArgilla
. (#4440)
Changed
- More productive and simpler shortcuts system (#4215)
- Move
ArgillaSingleton
,init
andactive_client
to a new modulesingleton
. (#4347) - Updated
argilla.load
functions to also work withFeedbackDataset
s. (#4347) - [breaking] Updated
argilla.delete
functions to also work withFeedbackDataset
s. It now raises an error if the dataset does not exist. (#4347) - Updated
argilla.list_datasets
functions to also work withFeedbackDataset
s. (#4347)
Fixed
- Fixed error in
TextClassificationSettings.from_dict
method in which thelabel_schema
created was a list ofdict
instead of a list ofstr
. (#4347) - Fixed total records on pagination component (#4424)
Removed
- Removed
draft
auto save for annotation view (#4334)
v1.20.0
🔆 Release highlights
Responses and suggestions filters
We’ve added new filters in the Argilla UI to filter records within Feedback datasets based on response values and suggestions information. It is also possible to sort records based on suggestion scores. This is available for questions of the type: LabelQuestion
, MultiLabelQuestion
and RatingQuestion
.
Utils module
Assign records
We added several methods to assign records to annotators via controlled overlap assign_records
and assign_workspaces
.
from argilla.client.feedback.utils import assign_records
assignments = assign_records(
users=users,
records=records,
overlap=1,
shuffle=True
)
from argilla.client.feedback.utils import assign_workspaces
assignments = assign_workspaces(
assignments=assignments,
workspace_type="individual"
)
for username, records in assignments.items():
dataset = rg.FeedbackDataset(
fields=fields, questions=questions, metadata=metadata,
vector_settings=vector_settings, guidelines=guidelines
)
dataset.add_records(records)
remote_dataset = dataset.push_to_argilla(name="my_dataset", workspace=username)
Multi-Modal DataURLs for images, video and audio
Argilla supports basic handling of video, audio, and images within markdown fields, provided they are formatted in HTML. To facilitate this, we offer three functions: video_to_html
, audio_to_html
, and image_to_html
. Note that performance differs per browser and database configuration.
from argilla.client.feedback.utils import audio_to_html, image_to_html, video_to_html
# Configure the FeedbackDataset
ds_multi_modal = rg.FeedbackDataset(
fields=[rg.TextField(name="content", use_markdown=True, required=True)],
questions=[rg.TextQuestion(name="description", title="Describe the content of the media:", use_markdown=True, required=True)],
)
# Add the records
records = [
rg.FeedbackRecord(fields={"content": video_to_html("/content/snapshot.mp4")}),
rg.FeedbackRecord(fields={"content": audio_to_html("/content/sea.wav")}),
rg.FeedbackRecord(fields={"content": image_to_html("/content/peacock.jpg")}),
]
ds_multi_modal.add_records(records)
# Push the dataset to Argilla
ds_multi_modal = ds_multi_modal.push_to_argilla("multi-modal-basic", workspace="admin")
Token Highlights
You can also add custom highlights to the text by using create_token_highlights
and a custom color map.
from argilla.client.feedback.utils import create_token_highlights
tokens = ["This", "is", "a", "test"]
weights = [0.1, 0.2, 0.3, 0.4]
html = create_token_highlights(tokens, weights, c_map=custom_RGB) # 'viridis' by default
1.20.0 Changelog
Added
- Added
GET /api/v1/datasets/:dataset_id/records/search/suggestions/options
endpoint to return suggestion available options for searching. (#4260) - Added
metadata_properties
to the__repr__
method of theFeedbackDataset
andRemoteFeedbackDataset
.(#4192). - Added
get_model_kwargs
,get_trainer_kwargs
,get_trainer_model
,get_trainer_tokenizer
andget_trainer
-methods to theArgillaTrainer
to improve interoperability across frameworks. (#4214). - Added additional formatting checks to the
ArgillaTrainer
to allow for better interoperability ofdefaults
andformatting_func
usage. (#4214). - Added a warning to the
update_config
-method ofArgillaTrainer
to emphasize if thekwargs
were updated correctly. (#4214). - Added
argilla.client.feedback.utils
module withhtml_utils
(this mainly includesvideo/audio/image_to_html
that convert media to dataURL to be able to render them in tha Argilla UI andcreate_token_highlights
to highlight tokens in a custom way. Both work on TextQuestion and TextField with use_markdown=True) andassignments
(this mainly includesassign_records
to assign records according to a number of annotators and records, an overlap and the shuffle option; andassign_workspace
to assign and create if needed a workspace according to the record assignment). (#4121)
Fixed
- Fixed error in
ArgillaTrainer
, with numerical labels, usingRatingQuestion
instead ofRankingQuestion
(#4171) - Fixed error in
ArgillaTrainer
, now we can train forextractive_question_answering
using a validation sample (#4204) - Fixed error in
ArgillaTrainer
, when training forsentence-similarity
it didn't work with a list of values per record (#4211) - Fixed error in the unification strategy for
RankingQuestion
(#4295) - Fixed
TextClassificationSettings.labels_schema
order was not being preserved. Closes #3828 (#4332) - Fixed error when requesting non-existing API endpoints. Closes #4073 (#4325)
- Fixed error when passing
draft
responses to create records endpoint. (#4354)
Changed
- [breaking] Suggestions
agent
field only accepts now some specific characters and a limited length. (#4265) - [breaking] Suggestions
score
field only accepts now float values in the range0
to1
. (#4266) - Updated
POST /api/v1/dataset/:dataset_id/records/search
endpoint to support optionalquery
attribute. (#4327) - Updated
POST /api/v1/dataset/:dataset_id/records/search
endpoint to supportfilter
andsort
attributes. (#4327) - Updated
POST /api/v1/me/datasets/:dataset_id/records/search
endpoint to support optionalquery
attribute. (#4270) - Updated
POST /api/v1/me/datasets/:dataset_id/records/search
endpoint to supportfilter
andsort
attributes. (#4270) - Changed the logging style while pulling and pushing
FeedbackDataset
to Argilla fromtqdm
style torich
. (#4267). Contributed by @zucchini-nlp. - Updated
push_to_argilla
to printrepr
of the pushedRemoteFeedbackDataset
after push and changedshow_progress
to True by default. (#4223) - Changed
models
andtokenizer
for theArgillaTrainer
to explicitly allow for changing them when needed. (#4214).
v1.19.0
🔆 Release highlights
🚨 Breaking changes
We have chosen to disable raining a ValueError
during the FeedbackDataset.*_by_name()
: FeedbackDataset.question_by_name()
, FeedbackDataset.field_by_name()
and FeedbackDataset.metadata_property_by_name
. Instead, these methods will now return None
when no match is found. This change is backwards compatible with previous versions of Argilla but might break your code if you are relying on the ValueError
to be raised.
Add vectors to your FeedbackDataset
You can now add vectors to your Feedback dataset and records to enable similarity search.
To do that, first, you need to add vector settings to your dataset:
dataset = rg.FeedbackDataset(
fields=[...],
questions=[....],
vector_settings=[
rg.VectorSettings(
name="my_vectors",
dimensions=768,
tite="My Vectors" #optional
)
]
)
Then, you can add vectors to your records where the key matches the name
of your vector settings and the value is a List[float]
:
record = rg.FeedbackRecord(
fields={...},
vectors={"my_vectors": [...]}
)
post_filter
step, since there is a bug that makes queries fail using filtering + KNN from Argilla.
See opensearch-project/k-NN#1286
Similarity search
If you have included vectors and vector settings in your dataset, you can use the similarity search features within that dataset.
In the Argilla UI, you can find records that are similar to each other using the Find similar
button at the top right corner of the record card. Here's how to do it:
With a collapsed reference record:
With an expanded reference record:
In the SDK, you can do the same like this:
ds = rg.FeedbackDataset.from_argilla("my_dataset", workspace="my_workspace")
# using another record
similar_records = ds.find_similar_records(
vector_name="my_vector",
record=ds.records[0],
max_results=5
)
# work with the resulting tuples
for record, score in similar_records:
...
You can also find records that are similar to a given text, but bear in mind that the dimensions of the resulting vector should be equal to that of the vector used in the dataset records:
similar_records = ds.find_similar_records(
vector_name="my_vector",
value=embedder_model.embeddings("My text is here")
# value=embedder_model.embeddings("My text is here").tolist() # for numpy arrays
)
FeedbackDataset
We added a show_progress
argument to from_huggingface()
method to make the progress bar for the parsing records process optional.
RemoteFeedbackDataset
We have added additional support for the pull()
-method of RemoteFeedbackDataset
. It is now possible to pull a RemoteFeedbackDataset
with a specific max_records
-argument. In combination with the earlier introduced filter_by
and sorty_by
this allows for more fine-grained control over the records that are pulled from Argilla.
ArgillaTrainer
The ArgillaTrainer
class has been updated to support additional features. Hugging Face models can now be shared to the Hugging Face Hub directly from the ArgillaTrainer.push_to_huggingface
-method. Additionally, we have included filter_by
, sort_by
, and max_records
arguments to the `ArgillaTrainer '-initialisation method to allow for more fine-grained control over the records used for training.
from argilla import SortBy
trainer = ArgillaTrainer(
dataset=dataset,
task=task,
framework="setfit",
filter_by={"response_status": ["submitted"]},
sort_by=[SortBy(field="metadata.my-metadata", order="asc")],
max_records=1000
)
🎨 UI improvements
- We have changed the layout of the filters for a slimmer and more flexible component that will host more filter types in the future without being disruptive.
- We have fixed a small UI bug where larger svg-images were pushed out of the visible screen, leading to a bad user experience.
- There is sorting support based on
inserted_at
andupdated_at
datetime fields.
1.19.0 Changelog
Added
- Added
POST /api/v1/datasets/:dataset_id/records/search
endpoint to search for records without user context, including responses by all users. (#4143) - Added
POST /api/v1/datasets/:dataset_id/vectors-settings
endpoint for creating vector settings for a dataset. (#3776) - Added
GET /api/v1/datasets/:dataset_id/vectors-settings
endpoint for listing the vectors settings for a dataset. (#3776) - Added
DELETE /api/v1/vectors-settings/:vector_settings_id
endpoint for deleting a vector settings. (#3776) - Added
PATCH /api/v1/vectors-settings/:vector_settings_id
endpoint for updating a vector settings. (#4092) - Added
GET /api/v1/records/:record_id
endpoint to get a specific record. (#4039) - Added support to include vectors for
GET /api/v1/datasets/:dataset_id/records
endpoint response usinginclude
query param. (#4063) - Added support to include vectors for
GET /api/v1/me/datasets/:dataset_id/records
endpoint response usinginclude
query param. (#4063) - Added support to include vectors for
POST /api/v1/me/datasets/:dataset_id/records/search
endpoint response usinginclude
query param. (#4063) - Added
show_progress
argument tofrom_huggingface()
method to make the progress bar for parsing records process optional.(#4132). - Added a progress bar for parsing records process to
from_huggingface()
method withtrange
intqdm
.(#4132). - Added to sort by
inserted_at
orupdated_at
for datasets with no metadata. (4147) - Added
max_records
argument topull()
method forRemoteFeedbackDataset
.(#4074) - Added functionality to push your models to the Hugging Face hub with
ArgillaTrainer.push_to_huggingface
(#3976). Contributed by @Racso-3141. - Added
filter_by
argument toArgillaTrainer
to filter byresponse_status
(#4120). - Added
sort_by
argument toArgillaTrainer
to sort bymetadata
(#4120). - Added
max_records
argument toArgillaTrainer
to limit record used for training (#4120). - Added
add_vector_settings
method to local and remoteFeedbackDataset
. (#4055) - Added
update_vectors_settings
method to local and remoteFeedbackDataset
. (#4122) - Added
delete_vectors_settings
method to local and remoteFeedbackDataset
. (#4130) - Added
vector_settings_by_name
method to local and remoteFeedbackDataset
. (#4055) - Added
find_similar_records
method to local and remoteFeedbackDataset
. (#4023) - Added
ARGILLA_SEARCH_ENGINE
environment variable to configure the search engine to use. (#4019)
Changed
- [breaking] Remove support for Elasticsearch < 8.5 and OpenSearch < 2.4. (#4173)
- [breaking] Users working with OpenSearch engines must use version >=2.4 and set
ARGILLA_SEARCH_ENGINE=opensearch
. (#4019 and #4111) - [breaking] Changed
FeedbackDataset.*_by_name()
methods to returnNone
when no match is found (#4101). - [breaking]
limit
query parameter forGET /api/v1/datasets/:dataset_id/records
endpoint is now only accepting values greater or equal than1
and less or equal than1000
. (#4143) - [breaking]
limit
query parameter forGET /api/v1/me/datasets/:dataset_id/records
endpoint is now only accepting values greater or equal than1
and less or equal than1000
. (#4143) - Update
GET /api/v1/datasets/:dataset_id/records
endpoint to fetch record using the search engine. (#4142) - Update
GET /api/v1/me/datasets/:dataset_id/records
endpoint to fetch record using the search engine. (#4142) - Update `POST /api/v1/datasets/:dataset_id/reco...
v1.18.0
🔆 Release highlights
💾 Add metadata properties to Feedback Datasets
You can now filter and sort records in Feedback Datasets in the UI and Python SDK using the metadata included in the records. To do that, you will first need to set up a MetadataProperty
in your dataset:
# set up a dataset including metadata properties
dataset = rg.FeedbackDataset(
fields=[
rg.TextField(name="prompt"),
rg.TextField(name="response"),
],
questions=[
rg.TextQuestion(name="question")
],
metadata_properties=[
rg.TermsMetadataProperty(name="source"),
rg.IntegerMetadataProperty(name="response_length", title="Response length")
]
)
Learn more about how to define metadata properties or adding or deleting metadata properties in existing datasets.
This will read the metadata in the records that match the name of the metadata property. Any other metadata present in the record not matching a metadata property will be saved but not available to use in the filtering and sorting features in the UI or SDK.
# create a record with metadata
record = rg.FeedbackRecord(
fields={
"prompt": "Why can camels survive long without water?",
"response": "Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time."
},
metadata={"source": "wikipedia", "response_length": 105, "my_hidden_metadata": "hidden metadata"}
)
Learn more about how to create records with metadata and how to add, modify or delete metadata from existing records.
🗃️ Filter and sort records using metadata in Feedback Datasets
In the Python SDK, you can filter and sort records based on the Metadata Properties that you set up for your dataset. You can combine multiple filters and sorts. Here is an example of how you could use them:
filtered_records = remote.filter_by(
metadata_filters=[
rg.IntegerMetadataFilter(
name="response_length",
ge=500, # optional: greater or equal to
le=1000 # optional: lower or equal to
),
rg.TermsMetadataFilter(
name="source",
values=["wikipedia", "wikihow"]
)
]
).sort_by(
[
rg.SortBy(
field="response_length",
order="desc" # for descending or "asc" for ascending
)
]
In the UI, simply use the Metadata
and Sort
components to filter and sort records like this:
metadata_filter_ui.mp4
Read more about filtering and sorting in Feedback Datasets.
⚠️ Breaking change using SQLite as backend in a docker deployment
From version 1.17.0 a new argilla
os user is configured for the provided docker images. If you are using the docker deployment and you want to upload to this version from versions older than v1.17.0 (If you already updated from v1.17.0 this step was already applied - see Release Notes), you should change permissions to the SQLite db file, before upgrading the version. You can do it with the following action:
docker exec --user root <argilla_server_container_id> /bin/bash -c 'chmod -R 777 "$ARGILLA_HOME_PATH"'
Note: You can find the docker container id by running:
docker ps | grep -i argilla-server
713973693fb7 argilla/argilla-server:v1.16.0 "/bin/bash start_arg…" 11 hours ago Up 7 minutes 0.0.0.0:6900->6900/tcp docker-argilla-1
Once the version is upgraded, we recommend to provided proper security access to this folder by setting the user and group to the new argilla
user:
docker exec --user root <argilla_server_container_id> /bin/bash -c 'chown -R argilla:argilla "$ARGILLA_HOME_PATH"'
1.18.0 Changelog
Added
- New
GET /api/v1/datasets/:dataset_id/metadata-properties
endpoint for listing dataset metadata properties. (#3813) - New
POST /api/v1/datasets/:dataset_id/metadata-properties
endpoint for creating dataset metadata properties. (#3813) - New
PATCH /api/v1/metadata-properties/:metadata_property_id
endpoint allowing the update of a specific metadata property. (#3952) - New
DELETE /api/v1/metadata-properties/:metadata_property_id
endpoint for deletion of a specific metadata property. (#3911) - New
GET /api/v1/metadata-properties/:metadata_property_id/metrics
endpoint to compute metrics for a specific metadata property. (#3856) - New
PATCH /api/v1/records/:record_id
endpoint to update a record. (#3920) - New
PATCH /api/v1/dataset/:dataset_id/records
endpoint to bulk update the records of a dataset. (#3934) - Missing validations to
PATCH /api/v1/questions/:question_id
. Nowtitle
anddescription
are using the same validations used to create questions. (#3967) - Added
TermsMetadataProperty
,IntegerMetadataProperty
andFloatMetadataProperty
classes allowing to define metadata properties for aFeedbackDataset
. (#3818) - Added
metadata_filters
tofilter_by
method inRemoteFeedbackDataset
to filter based on metadata i.e.TermsMetadataFilter
,IntegerMetadataFilter
, andFloatMetadataFilter
. (#3834) - Added a validation layer for both
metadata_properties
andmetadata_filters
in their schemas and as part of theadd_records
andfilter_by
methods, respectively. (#3860) - Added
sort_by
query parameter to listing records endpoints that allows to sort the records byinserted_at
,updated_at
or metadata property. (#3843) - Added
add_metadata_property
method to bothFeedbackDataset
andRemoteFeedbackDataset
(i.e.FeedbackDataset
in Argilla). (#3900) - Added fields
inserted_at
andupdated_at
inRemoteResponseSchema
. (#3822) - Added support for
sort_by
forRemoteFeedbackDataset
i.e. aFeedbackDataset
uploaded to Argilla. (#3925) - Added
metadata_properties
support for bothpush_to_huggingface
andfrom_huggingface
. (#3947) - Add support for update records (
metadata
) from Python SDK. (#3946) - Added
delete_metadata_properties
method to delete metadata properties. (#3932) - Added
update_metadata_properties
method to updatemetadata_properties
. (#3961) - Added automatic model card generation through
ArgillaTrainer.save
(#3857) - Added
FeedbackDataset
TaskTemplateMixin
for pre-defined task templates. (#3969) - A maximum limit of 50 on the number of options a ranking question can accept. (#3975)
- New
last_activity_at
field toFeedbackDataset
exposing when the last activity for the associated dataset occurs. (#3992)
Changed
GET /api/v1/datasets/{dataset_id}/records
,GET /api/v1/me/datasets/{dataset_id}/records
andPOST /api/v1/me/datasets/{dataset_id}/records/search
endpoints to return thetotal
number of records. (#3848, #3903)- Implemented
__len__
method for filtered datasets to return the number of records matching the provided filters. (#3916) - Increase the default max result window for Elasticsearch created for Feedback datasets. (#3929)
- Force elastic index refresh after records creation. (#3929)
- Validate metadata fields for filtering and sorting in the Python SDK. (#3993)
- Using metadata property name instead of id for indexing data in search engine index. (#3994)
Fixed
- Fixed response schemas to allow
values
to beNone
i.e. when a record is discarded theresponse.values
are set toNone
. (#3926) -
New Contributors
Full Changelog: v1.17.0...v1.18.0