Skip to content

fix: correct bug on sentence-transformers trainer with a list of values in the records #4211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 14, 2023

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Nov 13, 2023

Description

This PR solves an error in the ArgillaTrainer for sentence-similarity when the records include nested lists.

Closes #4088

Type of change

(Please delete options that are not relevant. Remember to title the PR according to the type of change)

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

How Has This Been Tested

(Please describe the tests that you ran to verify your changes. And ideally, reference tests)

Tested locally

Checklist

  • I followed the style guidelines of this project
  • I did a self-review of my code
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I filled out the contributor form (see text above)
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@davidberenstein1957
Copy link
Contributor

@plaguss, could you add some small tests for this too?

sample_keys = set(data[0].keys())
else:
raise ValueError(f"The type is not supported: {type(data[0])}.")

if sample_keys == {"label", "sentence-1", "sentence-2"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should review these check too in order to align with a set? also, there seems to be a faulty line in 1618 checking for sample_keys == sample_keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should review these check too in order to align with a set? also, there seems to be a faulty line in 1618 checking for sample_keys == sample_keys

I didn't noticed the error in line 1618.. thanks

@davidberenstein1957
Copy link
Contributor

@plaguss could you also check if the other implementations allow for passing generators too?

@plaguss
Copy link
Contributor Author

plaguss commented Nov 14, 2023

@plaguss could you also check if the other implementations allow for passing generators too?

We only have tests for generators in the text-classification frameworks. I'm not sure if trl/openai work, let me check

@plaguss
Copy link
Contributor Author

plaguss commented Nov 14, 2023

@plaguss could you also check if the other implementations allow for passing generators too?

We only have tests for generators in the text-classification frameworks. I'm not sure if trl/openai work, let me check

I was speaking with @sdiazlor and it seems that the generators are not working properly with the other frameworks (she mentioned the chat-completion task I think). Maybe we can close this one and tackle that in a different issue?

@plaguss plaguss marked this pull request as ready for review November 14, 2023 11:25
@davidberenstein1957
Copy link
Contributor

davidberenstein1957 commented Nov 14, 2023

@plaguss, I agree. Feel free to close this and create a separate issue.

Copy link

codecov bot commented Nov 14, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (967e9e7) 65.07% compared to head (ebe19fd) 91.27%.

Files Patch % Lines
src/argilla/client/feedback/training/schemas.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #4211       +/-   ##
============================================
+ Coverage    65.07%   91.27%   +26.19%     
============================================
  Files          319      319               
  Lines        18464    18468        +4     
============================================
+ Hits         12016    16857     +4841     
+ Misses        6448     1611     -4837     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4211-ki24f765kq-no.a.run.app

@davidberenstein1957 davidberenstein1957 merged commit 9347598 into develop Nov 14, 2023
@davidberenstein1957 davidberenstein1957 deleted the fix/trainer-rag branch November 14, 2023 13:35
davidberenstein1957 pushed a commit that referenced this pull request Nov 14, 2023
…es in the records (#4211)

<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

This PR solves an error in the `ArgillaTrainer` for
`sentence-similarity` when the records include nested lists.

Closes #4088

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

Tested locally

**Checklist**

- [x] I followed the style guidelines of this project
- [x] I did a self-review of my code
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [ ] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] formatting_func for sentence similarity return list
3 participants