Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] argilla: improve from_hub robustness #5524

Merged
merged 30 commits into from
Sep 25, 2024

Conversation

frascuchon
Copy link
Member

@frascuchon frascuchon commented Sep 20, 2024

Description

This PR adds some changes in how settings are built from the dataset features:

  • All questions are optional except the first one
  • Avoid generating extra questions/metadata duplicated from the same feature
  • A default question will be created if there is no question at all

Also, add robustness to the process with the following changes:

  • Sanitize settings and dataset name according to the server validation
  • Skip features with unsupported names (after sanitizing)
  • Allow passing subset dataset configuration for HF datasets.

These changes result in datasets in Argilla with a data structure more similar to the original one.

Type of change

  • Improvement (change adding some improvement to an existing functionality)

How Has This Been Tested

Checklist

  • I added relevant documentation
  • I followed the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • I confirm My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

- A default question will be create if no question at all
- All questions are optional except the first one
- Avoid generate extra questions/metadata duplicated from the same feature
Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me to only define these thing explicitly.

@burtenshaw burtenshaw requested a review from dvsrepo September 24, 2024 11:03
burtenshaw and others added 9 commits September 25, 2024 11:15
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

Closes #<issue_number>

**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Bug fix (non-breaking change which fixes an issue)
- New feature (non-breaking change which adds functionality)
- Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- Refactor (change restructuring the codebase without changing
functionality)
- Improvement (change adding some improvement to an existing
functionality)
- Documentation update

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Paco Aranda <[email protected]>
@frascuchon frascuchon changed the title [ENHANCEMENT] argilla: improve from hub capabilities [BUGFIX] argilla: improve from_hub robustness Sep 25, 2024
@frascuchon frascuchon merged commit 350474a into develop Sep 25, 2024
7 checks passed
@frascuchon frascuchon deleted the feat/argilla/improve-from-hub-capabilities branch September 25, 2024 13:58
frascuchon added a commit that referenced this pull request Sep 25, 2024
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

This PR adds some changes in how settings are built from the dataset
features:

- All questions are optional except the first one
- Avoid generating extra questions/metadata duplicated from the same
feature
- A default question will be created if there is no question at all

Also, add robustness to the process with the following changes:

- Sanitize settings and dataset name according to the server validation
- Skip features with unsupported names (after sanitizing)
- Allow passing subset dataset configuration for HF datasets.

These changes result in datasets in Argilla with a data structure more
similar to the original one.


**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Ben Burtenshaw <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: burtenshaw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants