Skip to content

Commit

Permalink
docs: add documentation for with andwithout records on from_hub (#5515)
Browse files Browse the repository at this point in the history
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

Closes #<issue_number>

**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Bug fix (non-breaking change which fixes an issue)
- New feature (non-breaking change which adds functionality)
- Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- Refactor (change restructuring the codebase without changing
functionality)
- Improvement (change adding some improvement to an existing
functionality)
- Documentation update

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Natalia Elvira <[email protected]>
Co-authored-by: nataliaElv <[email protected]>
  • Loading branch information
3 people authored Sep 19, 2024
1 parent da95b38 commit e1b2e6e
Show file tree
Hide file tree
Showing 3 changed files with 107 additions and 10 deletions.
3 changes: 3 additions & 0 deletions argilla/docs/how_to_guides/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,9 @@ new_dataset.create()

## Define dataset settings

!!! tip
Instead of defining your own custom settings, you can use some of our pre-built templates for text classification, ranking and rating. Learn more [here](../reference/argilla/settings/settings.md#creating-settings-using-built-in-templates).

### Fields

The fields in a dataset consist of one or more data items requiring annotation. Currently, Argilla supports plain text and markdown through the `TextField` and images through the `ImageField`, though we plan to introduce additional field types in future updates.
Expand Down
34 changes: 25 additions & 9 deletions argilla/docs/how_to_guides/import_export.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,28 +120,44 @@ dataset = rg.Dataset.from_hub(repo_id="<my_org>/<my_dataset>")

The `rg.Dataset.from_hub` method loads the configuration and records from the dataset repo. If you only want to load records, you can pass a `datasets.Dataset` object to the `rg.Dataset.log` method. This enables you to configure your own dataset and reuse existing Hub datasets. See the [guide on records](record.md) for more information.


!!! note "With or without records"

The example above will pull the dataset's `Settings` and records from the hub. If you only want to pull the dataset's configuration, you can set the `with_records` parameter to `False`. This is useful if you're just interested in a specific dataset template or you want to make changes in the dataset settings and/or records.
The example above will pull the dataset's `Settings` and records from the hub. If you only want to pull the dataset's configuration, you can set the `with_records` parameter to `False`. This is useful if you're just interested in a specific dataset template or you want to make changes in the records.

```python
dataset = rg.Dataset.from_hub(repo_id="<my_org>/<my_dataset>", with_records=False)
```

With the dataset's configuration, you could then make changes to the dataset. For example, you could adapt the dataset's settings for a different task:

```python
dataset.settings.questions = [rg.TextQuestion(name="answer")]
dataset.update()
```

You could then log the dataset's records using the `load_dataset` method of the `datasets` package and pass the dataset to the `rg.Dataset.log` method.

```python
hf_dataset = load_dataset("<my_org>/<my_dataset>")
dataset.records.log(hf_dataset)
dataset.records.log(hf_dataset) # (1)
```

1. You could also use the `mapping` parameter to map record field names to argilla field and question names.


#### Import settings from Hub

When importing datasets from the hub, Argilla will load settings from the hub in three ways:

1. If the dataset was pushed to hub by Argilla, then the settings will be loaded from the hub via the configuration file.
2. If the dataset was loaded by another source, then Argilla will define the settings based on the dataset's features in `datasets.Features`. For example, creating a `TextField` for a text feature or a `LabelQuestion` for a label class.
3. You can pass a custom `rg.Settings` object to the `rg.Dataset.from_hub` method via the `settings` parameter. This will override the settings loaded from the hub.

```python
settings = rg.Settings(
fields=[rg.TextField(name="text")],
questions=[rg.TextQuestion(name="answer")]
) # (1)

dataset = rg.Dataset.from_hub(repo_id="<my_org>/<my_dataset>", settings=settings)
```

1. The settings that you pass to the `rg.Dataset.from_hub` method will override the settings loaded from the hub, and need to align with the dataset being loaded.

### Local Disk

#### Export to Disk
Expand Down
80 changes: 79 additions & 1 deletion argilla/docs/reference/argilla/settings/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,85 @@ dataset.create()

```

> To define the settings for fields, questions, metadata, vectors, or distribution, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md), [`rg.TaskDistribution`](task_distribution.md) class documentation.
To define the settings for fields, questions, metadata, vectors, or distribution, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md), [`rg.TaskDistribution`](task_distribution.md) class documentation.

### Creating settings using built in templates

Argilla provides built-in templates for creating settings for common dataset types. To use a template, use the class methods of the `Settings` class. There are three built-in templates available for classification, ranking, and rating tasks. Template settings also include default guidelines and mappings.

#### Classification Task

You can define a classification task using the `rg.Settings.for_classification` class method. This will create a dataset with a text field and a label question. You can select field types using the `field_type` parameter with `image` or `text`.

```python
settings = rg.Settings.for_classification(labels=["positive", "negative"]) # (1)
```

This will return a `Settings` object with the following settings:

```python
settings = Settings(
guidelines="Select a label for the document.",
fields=[rg.TextField(field_type)(name="text")],
questions=[LabelQuestion(name="label", labels=labels)],
mapping={"input": "text", "output": "label", "document": "text"},
)
```

#### Ranking Task

You can define a ranking task using the `rg.Settings.for_ranking` class method. This will create a dataset with a text field and a ranking question.

```python
settings = rg.Settings.for_ranking()
```

This will return a `Settings` object with the following settings:

```python
settings = Settings(
guidelines="Rank the responses.",
fields=[
rg.TextField(name="instruction"),
rg.TextField(name="response1"),
rg.TextField(name="response2"),
],
questions=[RankingQuestion(name="ranking", values=["response1", "response2"])],
mapping={
"input": "instruction",
"prompt": "instruction",
"chosen": "response1",
"rejected": "response2",
},
)
```

#### Rating Task

You can define a rating task using the `rg.Settings.for_rating` class method. This will create a dataset with a text field and a rating question.

```python
settings = rg.Settings.for_rating()
```

This will return a `Settings` object with the following settings:

```python
settings = Settings(
guidelines="Rate the response.",
fields=[
rg.TextField(name="instruction"),
rg.TextField(name="response"),
],
questions=[RatingQuestion(name="rating", values=[1, 2, 3, 4, 5])],
mapping={
"input": "instruction",
"prompt": "instruction",
"output": "response",
"score": "rating",
},
)
```

---

Expand Down

0 comments on commit e1b2e6e

Please sign in to comment.