Improved speed and memory of multilabel pair generation #322

OskarLiew · 2023-03-01T14:27:44Z

Adds a new class for multilabel pair-generation. It is an iterable dataset, so the data does not need to be generated up-front, allowing for arbitrary sized datasets and utilizes sparse matrix representation to significantly reduce memory requirements for large datasets with sparse target matrices.

On datasets where most examples belong to most classes (dense target-matrix), this change will make it slower and more memory-hungry than before. However, I believe such datasets with sufficiently many classes to be problematic are very rare.

On my dataset with 300+ classes and 100k+ examples this function is multiple orders of magnitude faster than the old version, and crucially, does not crash due to memory overflow.

HuggingFaceDocBuilderDev · 2023-03-01T14:31:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

tomaarsen · 2023-03-03T15:48:01Z

I was able to try this out on go_emotions and it helped a lot with generating the pairs. I'll try to have a better look at this next week and I'll try to include this into v1.0.0. Thank you for your work on this!

sachin-patel-qp · 2023-07-12T13:45:08Z

@tomaarsen Any estimated date for v1.0.0?

tomaarsen · 2023-12-05T15:27:25Z

The upcoming v1.0.0 release in #439 has already significantly refactored the sampling. I intend to move forward with that sampler now, though I do recognize that perhaps this one will be more efficient. My apologies for this.
I will close this PR as it will become one big merge conflict once #439 merges this week, but I want to thank you for taking the time to prepare this PR.

@sachin-patel-qp v1.0.0 will release this week.

Tom Aarsen

OskarLiew added 7 commits March 1, 2023 15:02

Add multi-target support to SetFitHead

46612a6

Improved type hints and setfit head predict_proba

5762485

Tests for multi-target setfithead

4030dac

Improved multi-target pairs generation

5c901fb

Add dataloader for SentencePairDataset

eb76266

Fixing rebase shenanigans

8a2ff42

small fixes

e9881b0

OskarLiew closed this Mar 1, 2023

Add max cache-size

9df3a3d

OskarLiew reopened this Mar 1, 2023

Fix code style

489185e

OskarLiew changed the title ~~Feature/quick pairgen~~ Improved speed and memory of multilabel pair generation Mar 1, 2023

Remove iteration progress bar

f38938b

tomaarsen added the enhancement New feature or request label Mar 3, 2023

tomaarsen closed this Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved speed and memory of multilabel pair generation #322

Improved speed and memory of multilabel pair generation #322

Uh oh!

OskarLiew commented Mar 1, 2023 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 1, 2023

Uh oh!

tomaarsen commented Mar 3, 2023

Uh oh!

sachin-patel-qp commented Jul 12, 2023

Uh oh!

tomaarsen commented Dec 5, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Improved speed and memory of multilabel pair generation #322

Improved speed and memory of multilabel pair generation #322

Uh oh!

Conversation

OskarLiew commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 1, 2023

Uh oh!

tomaarsen commented Mar 3, 2023

Uh oh!

sachin-patel-qp commented Jul 12, 2023

Uh oh!

tomaarsen commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OskarLiew commented Mar 1, 2023 •

edited

Loading

tomaarsen commented Dec 5, 2023 •

edited

Loading