Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE-BRANCH] Argilla direct import from Hub #5572

Merged
merged 28 commits into from
Oct 29, 2024

Conversation

jfcalvo
Copy link
Member

@jfcalvo jfcalvo commented Oct 7, 2024

Description

Changes to support importing datasets from HF Hub.

Refs argilla-io/roadmap#21

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested

  • Modifying test suite.

Checklist

  • I added relevant documentation
  • I followed the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • I confirm My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

…eld to at least one field (required or not) (#5569)

# Description

In this PR we are changing how dataset publish validation works moving
from:
* At least one required field to be a valid publishable dataset.

To:

* At least one field (required or not) to be a valid publishable
dataset.

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)


**How Has This Been Tested**

- [x] Modifying test suite. 

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
Copy link

codecov bot commented Oct 7, 2024

Codecov Report

Attention: Patch coverage is 90.83665% with 23 lines in your changes missing coverage. Please review.

Project coverage is 91.24%. Comparing base (32bd370) to head (b2a1590).
Report is 6 commits behind head on develop.

Files with missing lines Patch % Lines
...-server/src/argilla_server/api/handlers/v1/jobs.py 66.66% 7 Missing ⚠️
argilla-server/src/argilla_server/jobs/hub_jobs.py 75.00% 5 Missing ⚠️
...rgilla_server/api/handlers/v1/datasets/datasets.py 50.00% 4 Missing ⚠️
...c/argilla_server/api/policies/v1/dataset_policy.py 40.00% 3 Missing ⚠️
argilla-server/src/argilla_server/contexts/hub.py 97.43% 3 Missing ⚠️
...r/src/argilla_server/api/policies/v1/job_policy.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #5572      +/-   ##
===========================================
- Coverage    91.28%   91.24%   -0.04%     
===========================================
  Files          145      150       +5     
  Lines         6036     6250     +214     
===========================================
+ Hits          5510     5703     +193     
- Misses         526      547      +21     
Flag Coverage Δ
argilla-server 91.24% <90.83%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jfcalvo and others added 10 commits October 7, 2024 09:37
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

This PR removes name pattern validation for field, question,
metadata-property and vector-settings

**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Refactor (change restructuring the codebase without changing
functionality)
- Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
# Description

This PR add a new `metadata` column to `datasets` table and the
following changes:
* When creating a dataset now it is possible to specify `metadata`
values.
* When updating a dataset now it is possible to specify `metadata`
values.
* Now `metadata` attribute is included on `Dataset` schema so the
`metadata` of a dataset is exposed in our API.

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [x] Adding new tests.

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Paco Aranda <[email protected]>
# Description

This PR add changes to support the import of datasets from HF Hub into
Argilla with the following features:
* Add a new `POST /api/v1/datasets/:dataset_id/import` endpoint that
will enqueue a background job to import records from a HF Hub dataset,
returning information about the enqueued job. It expect the following
parameters:
* `name`: the name of the dataset (i.e. `lhoestq/demo1`) @burtenshaw
suggested changing it to `repo_id`
* `subset`: the dataset subset (i.e. `default`) @burtenshaw suggested to
make the parameter optional
* `split`: the dataset split (i.e. `train`) @burtenshaw suggested to
make the parameter opcional
* Add a new background job so the import process can be done outside of
request time.
* Add a new `HubDataset` class encapsulating all the logic to import a
dataset from the Hub.
* Add a new `/api/v1/jobs/:job_id` to get information about the status
of one specific job. This is useful if the UI or the SDK needs to know
if the import process finished. (@frascuchon we can use this to give
information about other processes, for example when a dataset
distribution settings is changed).

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [x] Manually testing and adding more automatic tests to our suite.

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Paco Aranda <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@@ -21,7 +21,7 @@ jobs:
build:
services:
argilla-server:
image: argilladev/argilla-hf-spaces:develop
image: argilladev/argilla-hf-spaces:pr-5572
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Apply this change before merge the PR.

Suggested change
image: argilladev/argilla-hf-spaces:pr-5572
image: argilladev/argilla-hf-spaces:develop

jfcalvo and others added 15 commits October 18, 2024 14:34
# Description

Using the dataset
https://huggingface.co/datasets/mlabonne/ultrachat_200k_sft we have
found that the import feature was not mapping correctly the `message`
feature.

In order to fix this I'm improving with this PR how the feature values
casting is done, checking if the features are instances of certain
feature classes instead of using the `_type` method.

I have also added a new test importing the `mlabonne/ultrachat_200k_sft`
dataset and using chat fields.

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [x] Adding new tests to the suite.

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
…ass labels (#5613)

# Description

This PR adds the following changes:
* Add casting for features using sequences of class labels (casting them
using `int2str` function).
* Casting to string values for suggestions mapped to multi label
questions (iterating over the values).

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [x] Adding additional tests to our suite.

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
…id` is not provided on mapping (#5616)

# Description

This PR add the dataset imported split to be used as `external_id` when
there is no value for `external_id` specified on the import mapping.

If importing the split `train` for a dataset and no `external_id` is
provided the `external_id` will be calculated like the following:
- `train_0`: first row of `train` split.
- `train_1`: second row of `train` split.
- ...

With this we are avoiding row duplications when another split is
imported to the same dataset. So if later we import the `test` split for
the same dataset we will have for `external_id`:
- `train_0`: first row of `train` split.
- `train_1`: second row of `train` split.
- ...
- `test_0`: first row of `test` split.
- `test_1`: second row of `test` split.
- ...

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [x] Adding additional tests.

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

Using row idx + split name to generate default record id.


**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->
- Refactor (change restructuring the codebase without changing
functionality)
- Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

This PR fixes problems when a class label sequence is mapped as
suggestions.

NOTE: This PR does not infer a sequence of class labels as multi-label
questions. Only prevent errors when a sequence of class labels column is
used as a suggestion (or other kinds of properties in argilla)

**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Bug fix (non-breaking change which fixes an issue)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

This PR adds changes to the `Dataset.from_hub` method to open/return the
argilla URL for creating datasets from the UI.

**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Improvement (change adding some improvement to an existing
functionality)
- Documentation update

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: burtenshaw <[email protected]>
Co-authored-by: Natalia Elvira <[email protected]>
…5639)

# Description

This PR adds a new validation avoiding the creation of records with
`fields` attribute empty. It also includes a fix to a bug to a chat
field validation method.

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [x] Adding new tests.

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
Co-authored-by: Leire Aguirre <[email protected]>
Co-authored-by: Paco Aranda <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Francisco Aranda <[email protected]>
@jfcalvo jfcalvo added this to the v2.4.0 milestone Oct 29, 2024
@frascuchon frascuchon merged commit a64e15b into develop Oct 29, 2024
6 of 7 checks passed
@frascuchon frascuchon deleted the feat/argilla-direct-feature-branch branch October 29, 2024 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants