-
Notifications
You must be signed in to change notification settings - Fork 399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE-BRANCH] Argilla direct import from Hub #5572
Conversation
…eld to at least one field (required or not) (#5569) # Description In this PR we are changing how dataset publish validation works moving from: * At least one required field to be a valid publishable dataset. To: * At least one field (required or not) to be a valid publishable dataset. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Modifying test suite. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5572 +/- ##
===========================================
- Coverage 91.28% 91.24% -0.04%
===========================================
Files 145 150 +5
Lines 6036 6250 +214
===========================================
+ Hits 5510 5703 +193
- Misses 526 547 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
# Description <!-- Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change. --> This PR removes name pattern validation for field, question, metadata-property and vector-settings **Type of change** <!-- Please delete options that are not relevant. Remember to title the PR according to the type of change --> - Refactor (change restructuring the codebase without changing functionality) - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** <!-- Please add some reference about how your feature has been tested. --> **Checklist** <!-- Please go over the list and make sure you've taken everything into account --> - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
for more information, see https://pre-commit.ci
# Description This PR add a new `metadata` column to `datasets` table and the following changes: * When creating a dataset now it is possible to specify `metadata` values. * When updating a dataset now it is possible to specify `metadata` values. * Now `metadata` attribute is included on `Dataset` schema so the `metadata` of a dataset is exposed in our API. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Adding new tests. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Paco Aranda <[email protected]>
# Description This PR add changes to support the import of datasets from HF Hub into Argilla with the following features: * Add a new `POST /api/v1/datasets/:dataset_id/import` endpoint that will enqueue a background job to import records from a HF Hub dataset, returning information about the enqueued job. It expect the following parameters: * `name`: the name of the dataset (i.e. `lhoestq/demo1`) @burtenshaw suggested changing it to `repo_id` * `subset`: the dataset subset (i.e. `default`) @burtenshaw suggested to make the parameter optional * `split`: the dataset split (i.e. `train`) @burtenshaw suggested to make the parameter opcional * Add a new background job so the import process can be done outside of request time. * Add a new `HubDataset` class encapsulating all the logic to import a dataset from the Hub. * Add a new `/api/v1/jobs/:job_id` to get information about the status of one specific job. This is useful if the UI or the SDK needs to know if the import process finished. (@frascuchon we can use this to give information about other processes, for example when a dataset distribution settings is changed). Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Manually testing and adding more automatic tests to our suite. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Paco Aranda <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@@ -21,7 +21,7 @@ jobs: | |||
build: | |||
services: | |||
argilla-server: | |||
image: argilladev/argilla-hf-spaces:develop | |||
image: argilladev/argilla-hf-spaces:pr-5572 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Apply this change before merge the PR.
image: argilladev/argilla-hf-spaces:pr-5572 | |
image: argilladev/argilla-hf-spaces:develop |
# Description Using the dataset https://huggingface.co/datasets/mlabonne/ultrachat_200k_sft we have found that the import feature was not mapping correctly the `message` feature. In order to fix this I'm improving with this PR how the feature values casting is done, checking if the features are instances of certain feature classes instead of using the `_type` method. I have also added a new test importing the `mlabonne/ultrachat_200k_sft` dataset and using chat fields. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Adding new tests to the suite. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
…ass labels (#5613) # Description This PR adds the following changes: * Add casting for features using sequences of class labels (casting them using `int2str` function). * Casting to string values for suggestions mapped to multi label questions (iterating over the values). Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Adding additional tests to our suite. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
…id` is not provided on mapping (#5616) # Description This PR add the dataset imported split to be used as `external_id` when there is no value for `external_id` specified on the import mapping. If importing the split `train` for a dataset and no `external_id` is provided the `external_id` will be calculated like the following: - `train_0`: first row of `train` split. - `train_1`: second row of `train` split. - ... With this we are avoiding row duplications when another split is imported to the same dataset. So if later we import the `test` split for the same dataset we will have for `external_id`: - `train_0`: first row of `train` split. - `train_1`: second row of `train` split. - ... - `test_0`: first row of `test` split. - `test_1`: second row of `test` split. - ... Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Adding additional tests. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
# Description <!-- Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change. --> Using row idx + split name to generate default record id. **Type of change** <!-- Please delete options that are not relevant. Remember to title the PR according to the type of change --> - Refactor (change restructuring the codebase without changing functionality) - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** <!-- Please add some reference about how your feature has been tested. --> **Checklist** <!-- Please go over the list and make sure you've taken everything into account --> - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
# Description <!-- Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change. --> This PR fixes problems when a class label sequence is mapped as suggestions. NOTE: This PR does not infer a sequence of class labels as multi-label questions. Only prevent errors when a sequence of class labels column is used as a suggestion (or other kinds of properties in argilla) **Type of change** <!-- Please delete options that are not relevant. Remember to title the PR according to the type of change --> - Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** <!-- Please add some reference about how your feature has been tested. --> **Checklist** <!-- Please go over the list and make sure you've taken everything into account --> - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
# Description <!-- Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change. --> This PR adds changes to the `Dataset.from_hub` method to open/return the argilla URL for creating datasets from the UI. **Type of change** <!-- Please delete options that are not relevant. Remember to title the PR according to the type of change --> - Improvement (change adding some improvement to an existing functionality) - Documentation update **How Has This Been Tested** <!-- Please add some reference about how your feature has been tested. --> **Checklist** <!-- Please go over the list and make sure you've taken everything into account --> - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: burtenshaw <[email protected]> Co-authored-by: Natalia Elvira <[email protected]>
…5639) # Description This PR adds a new validation avoiding the creation of records with `fields` attribute empty. It also includes a fix to a bug to a chat field validation method. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Adding new tests. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
Co-authored-by: Leire Aguirre <[email protected]> Co-authored-by: Paco Aranda <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Francisco Aranda <[email protected]>
Description
Changes to support importing datasets from HF Hub.
Refs argilla-io/roadmap#21
Type of change
How Has This Been Tested
Checklist