Skip to content

Conversation

@OskarLiew
Copy link
Contributor

@OskarLiew OskarLiew commented Mar 1, 2023

Adds a new class for multilabel pair-generation. It is an iterable dataset, so the data does not need to be generated up-front, allowing for arbitrary sized datasets and utilizes sparse matrix representation to significantly reduce memory requirements for large datasets with sparse target matrices.

On datasets where most examples belong to most classes (dense target-matrix), this change will make it slower and more memory-hungry than before. However, I believe such datasets with sufficiently many classes to be problematic are very rare.

On my dataset with 300+ classes and 100k+ examples this function is multiple orders of magnitude faster than the old version, and crucially, does not crash due to memory overflow.

@OskarLiew OskarLiew closed this Mar 1, 2023
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@OskarLiew OskarLiew reopened this Mar 1, 2023
@OskarLiew OskarLiew changed the title Feature/quick pairgen Improved speed and memory of multilabel pair generation Mar 1, 2023
@tomaarsen tomaarsen added the enhancement New feature or request label Mar 3, 2023
@tomaarsen
Copy link
Member

I was able to try this out on go_emotions and it helped a lot with generating the pairs. I'll try to have a better look at this next week and I'll try to include this into v1.0.0. Thank you for your work on this!

@sachin-patel-qp
Copy link

@tomaarsen Any estimated date for v1.0.0?

@tomaarsen
Copy link
Member

tomaarsen commented Dec 5, 2023

The upcoming v1.0.0 release in #439 has already significantly refactored the sampling. I intend to move forward with that sampler now, though I do recognize that perhaps this one will be more efficient. My apologies for this.
I will close this PR as it will become one big merge conflict once #439 merges this week, but I want to thank you for taking the time to prepare this PR.

@sachin-patel-qp v1.0.0 will release this week.

  • Tom Aarsen

@tomaarsen tomaarsen closed this Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants