Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
218 commits
Select commit Hold shift + click to select a range
1acdd5c
Implement Trainer & TrainingArguments w. tests
tomaarsen Jan 11, 2023
89f4435
Readded support for hyperparameter tuning
tomaarsen Jan 11, 2023
5f2a6b3
Remove unused imports and reformat
tomaarsen Jan 11, 2023
622f33b
Preserve desired behaviour despite deprecation of keep_body_frozen pa…
tomaarsen Jan 11, 2023
ff59154
Ensure that DeprecationWarnings are displayed
tomaarsen Jan 11, 2023
3b4ef58
Set Trainer.freeze and Trainer.unfreeze methods normally
tomaarsen Jan 11, 2023
fd68274
Add TrainingArgument tests for num_epochs, batch_sizes, lr
tomaarsen Jan 11, 2023
14602ea
Convert trainer.train arguments into a softer deprecation
tomaarsen Jan 11, 2023
94106cc
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Jan 22, 2023
a39e772
Merge branch 'refactor_v2' of https://github.com/tomaarsen/setfit; br…
tomaarsen Jan 23, 2023
9fc55a6
Use body/head_learning_rate instead of classifier/embedding_learning_…
tomaarsen Jan 23, 2023
7d4ad00
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Jan 23, 2023
aab2377
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Feb 6, 2023
dee70b1
Reformat according to the newest black version
tomaarsen Feb 6, 2023
fb6547d
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Feb 6, 2023
abbbb03
Remove "classifier" from var names in SetFitHead
tomaarsen Feb 6, 2023
12d326e
Update DeprecationWarnings to include timeline
tomaarsen Feb 6, 2023
70c0295
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Feb 6, 2023
fc246cc
Convert training_argument imports to relative imports
tomaarsen Feb 6, 2023
57aa54f
Make conditional explicit
tomaarsen Feb 6, 2023
7ebdf93
Make conditional explicit
tomaarsen Feb 6, 2023
4695293
Use assertEqual rather than assert
tomaarsen Feb 6, 2023
4c6d0fd
Remove training_arguments from test func names
tomaarsen Feb 6, 2023
5937ec2
Replace loss_class on Trainer with loss on TrainArgs
tomaarsen Feb 6, 2023
f1e3de9
Removed dead class argument
tomaarsen Feb 6, 2023
6051095
Move SupConLoss to losses.py
tomaarsen Feb 6, 2023
bddd46a
Add deprecation to Trainer.(un)freeze
tomaarsen Feb 7, 2023
fa8a077
Prevent warning from always triggering
tomaarsen Feb 7, 2023
85a3684
Export TrainingArguments in __init__
tomaarsen Feb 7, 2023
ca625a2
Update & add important missing docstrings
tomaarsen Feb 7, 2023
868d7b7
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Feb 7, 2023
68e9094
Use standard dataclass initialization for SetFitModel
tomaarsen Feb 8, 2023
19a6fc8
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Feb 15, 2023
0b2efa1
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Feb 15, 2023
ca87c42
Remove duplicate space in DeprecationWarning
tomaarsen Feb 16, 2023
cc5282f
No longer require labeled data for DistillationTrainer
tomaarsen Mar 3, 2023
c6f5782
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Mar 3, 2023
36cbbfe
Update docs for v1.0.0
tomaarsen Mar 6, 2023
deb57ff
Remove references of SetFitTrainer
tomaarsen Mar 6, 2023
46922d5
Update expected test output
tomaarsen Mar 6, 2023
f43d5b2
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Apr 19, 2023
b0f9f58
Remove unused pipeline
tomaarsen Apr 19, 2023
339f332
Execute deprecations
tomaarsen Apr 19, 2023
9e0bf78
Stop importing now-removed function
tomaarsen Apr 19, 2023
ecabbcf
Initial setup for logging & callbacks
tomaarsen Jul 6, 2023
6e6720b
Move sentence-transformer training into trainer.py
tomaarsen Jul 6, 2023
826eb53
Add checkpointing, support EarlyStoppingCallback
tomaarsen Jul 28, 2023
019a971
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Jul 29, 2023
1930973
Run formatting
tomaarsen Jul 29, 2023
e4f3f76
Merge branch 'refactor_v2' of https://github.com/tomaarsen/setfit int…
tomaarsen Jul 29, 2023
0f66109
Merge pull request #4 from tomaarsen/feat/logging_callbacks
tomaarsen Jul 29, 2023
a87cdc0
Add additional trainer tests
tomaarsen Jul 29, 2023
d418759
Use isinstance, required by flake8 release from 1hr ago
tomaarsen Jul 29, 2023
08892f6
sampler for refactor WIP
danstan5 Sep 14, 2023
0a2b664
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Oct 17, 2023
429de0f
Merge branch 'refactor_v2' of https://github.com/tomaarsen/setfit int…
tomaarsen Oct 17, 2023
173f084
Run formatters
tomaarsen Oct 17, 2023
c23959a
Remove tests from modeling.py
tomaarsen Oct 17, 2023
0fa3870
Add missing type hint
tomaarsen Oct 17, 2023
3969f38
Adjust test to still pass if W&B/Tensorboard are installed
tomaarsen Oct 17, 2023
567f1c9
Merge branch 'refactor_v2' of https://github.com/tomaarsen/setfit int…
tomaarsen Oct 17, 2023
851f0bb
The log/eval/save steps should be saved on the state instead
tomaarsen Oct 17, 2023
67ddedc
Merge branch 'refactor_v2' of https://github.com/tomaarsen/setfit int…
tomaarsen Oct 17, 2023
d37ee09
sampler logic fix "unique" strategy
danstan5 Oct 19, 2023
0ef8837
add sampler tests (not complete)
danstan5 Oct 19, 2023
131aa26
add sampling_strategy into TrainingArguments
danstan5 Oct 19, 2023
c6c6228
Merge branch 'refactor-sampling' of https://github.com/danstan5/setfi…
danstan5 Oct 19, 2023
7431005
num_iterations removed from TrainingArguments
danstan5 Oct 19, 2023
3bd2acc
run_fewshot compatible with <v.1.0.0
danstan5 Oct 20, 2023
3d07e6c
Run make style
tomaarsen Oct 25, 2023
978daee
Use "no" as the default evaluation_strategy
tomaarsen Oct 25, 2023
2802a3f
Move num_iterations back to TrainingArguments
tomaarsen Oct 25, 2023
391f991
Fix broken trainer tests due to new default sampling
tomaarsen Oct 25, 2023
f8b7253
Use the Contrastive Dataset for Distillation
tomaarsen Oct 25, 2023
38e9607
Set the default logging steps at 50
tomaarsen Oct 25, 2023
4ead15d
Add max_steps argument to TrainingArguments
tomaarsen Oct 25, 2023
eb70336
Change max_steps conditional
tomaarsen Oct 25, 2023
3478799
Merge pull request #5 from danstan5/refactor-sampling
tomaarsen Oct 27, 2023
d9c4a05
Merge branch 'main' of https://github.com/huggingface/setfit into ref…
tomaarsen Nov 9, 2023
5b39f06
Seeds are now correctly applied for reproducibility
tomaarsen Nov 9, 2023
d8177db
Add files via upload
MosheWasserb Nov 9, 2023
7c3feed
Don't scale gradients during evaluation
tomaarsen Nov 9, 2023
cdc8979
Use evaluation_strategy="steps" if eval_steps is set
tomaarsen Nov 9, 2023
e040167
Run formatting
tomaarsen Nov 9, 2023
d2f2489
Implement SetFit for ABSA from Intel Labs (#6)
tomaarsen Nov 9, 2023
5c4569d
Import optuna under TYPE_CHECKING
tomaarsen Nov 9, 2023
ceeb725
Remove unused import, reformat
tomaarsen Nov 9, 2023
5c669b5
Add MANIFEST.in with model_card_template
tomaarsen Nov 9, 2023
8e201e5
Don't require transformers TrainingArgs in tests
tomaarsen Nov 9, 2023
6ae5045
Update URLs in setup.py
tomaarsen Nov 9, 2023
ecaabb4
Increase min hf_hub version to 0.12.0 for SoftTemporaryDirectory
tomaarsen Nov 9, 2023
4e79397
Include MANIFEST.in data via `include_package_data=True`
tomaarsen Nov 9, 2023
65aff32
Use kwargs instead of args in super call
tomaarsen Nov 9, 2023
eeeac55
Use v0.13.0 as min. version as huggingface/huggingface_hub#1315
tomaarsen Nov 9, 2023
3214f1b
Use en_core_web_sm for tests
tomaarsen Nov 10, 2023
2b78bb0
Remove incorrect spacy_model from AspectModel/PolarityModel
tomaarsen Nov 10, 2023
b68f655
Rerun formatting
tomaarsen Nov 10, 2023
d85f0d9
Run CI on pre branch & workflow dispatch
tomaarsen Nov 10, 2023
b636cd7
Merge pull request #265 from tomaarsen/refactor_v2
tomaarsen Nov 10, 2023
81952bf
Set development version to 1.0.0.dev0
tomaarsen Nov 10, 2023
5b76361
Extend training argument tests
tomaarsen Nov 10, 2023
54b5d55
Only create evaluation dataloader if eval_strat is set
tomaarsen Nov 14, 2023
4788713
Run formatting
tomaarsen Nov 14, 2023
74a5b7c
max_steps isn't optional
tomaarsen Nov 14, 2023
7ef5bbc
Fix indentation of docstring
tomaarsen Nov 15, 2023
ca3030f
Apply fixes for HPO
tomaarsen Nov 15, 2023
f114572
Remove outdated tests
tomaarsen Nov 15, 2023
8d118d5
Use SetFitModel as the model in CallbackHandler
tomaarsen Nov 21, 2023
b964238
Correctly set the total training steps based on args.max_steps
tomaarsen Nov 21, 2023
2f06847
Add missing comma
tomaarsen Nov 21, 2023
fcb38fc
Capitalize first letter of sentence
tomaarsen Nov 21, 2023
9fe6f0d
Run formatting
tomaarsen Nov 21, 2023
da338ad
Remove unused arguments in tests
tomaarsen Nov 21, 2023
be4c900
Initial documentation for SetFit v1.0.0
tomaarsen Nov 21, 2023
fb42dd7
Update the documentation related workflows
tomaarsen Nov 21, 2023
04c45d7
Merge branch 'main' of https://github.com/huggingface/setfit into v1.…
tomaarsen Nov 21, 2023
bfe6ef6
Add figure to zero-shot how-to guide
tomaarsen Nov 21, 2023
773b860
Add docs notebook building support
tomaarsen Nov 21, 2023
883889c
Update broken, redirecting links
tomaarsen Nov 21, 2023
b4e5db0
polarity -> label
tomaarsen Nov 22, 2023
dbd707b
Mention extra download requirements for ABSA
tomaarsen Nov 22, 2023
552cecc
Merge branch 'main' of https://github.com/huggingface/setfit into v1.…
tomaarsen Nov 24, 2023
0d32dd1
Implement 'batch_size' on model.predict
tomaarsen Nov 24, 2023
392cf0d
Add batch sizes to toctree
tomaarsen Nov 24, 2023
ee00c40
Merge pull request #443 from tomaarsen/feat/expose_batch_size
tomaarsen Nov 24, 2023
17d6513
Save model head on CPU
tomaarsen Nov 24, 2023
dca6fd0
torch.Module -> torch.nn.Module
tomaarsen Nov 24, 2023
4123609
Merge pull request #444 from tomaarsen/feat/cpu_load_diff_head
tomaarsen Nov 24, 2023
b5a6361
Add new top-level header to docs reference
tomaarsen Nov 24, 2023
6ca989e
Update docs about return value of metric function
tomaarsen Nov 24, 2023
93c52dd
Add "use_auth_token" to migration guide
tomaarsen Nov 24, 2023
44daad4
Allow 'device' on SetFitModel.from_pretrained()
tomaarsen Nov 24, 2023
6f06204
Add tests for SetFitABSA as well
tomaarsen Nov 24, 2023
c41b7c3
Merge pull request #445 from tomaarsen/feat/load_on_device
tomaarsen Nov 24, 2023
b8da4a3
Update which trainer methods are documented
tomaarsen Nov 24, 2023
639750f
Link to the Hub in d ocstring
tomaarsen Nov 27, 2023
9ffc262
Add scikit-learn API version of SetFit to related work
tomaarsen Nov 27, 2023
2ef61bb
Batch Sizes + "for Inference"
tomaarsen Nov 27, 2023
b8b8417
Make first column bold in Sampling Strategy table
tomaarsen Nov 27, 2023
a2fa84f
Remove comment about Google Colab with Python 3.7
tomaarsen Nov 27, 2023
e2cf782
Rename file, remove distilBERT, fix typos
tomaarsen Nov 27, 2023
c5ea28d
Merge branch 'v1.0.0-pre' of https://github.com/huggingface/setfit in…
tomaarsen Nov 27, 2023
c7f49ad
Add ONNX tutorial to docs
tomaarsen Nov 27, 2023
193f83f
Merge pull request #435 from huggingface/moshe
tomaarsen Nov 27, 2023
8e0c55c
Update docstring of from_pretrained!
tomaarsen Nov 28, 2023
19d6d9d
Revert "Update docstring of from_pretrained!"
tomaarsen Nov 28, 2023
5058e31
Update docstring of from_pretrained!
tomaarsen Nov 28, 2023
3e829ba
Update docstring edits of from_pretrained
tomaarsen Nov 28, 2023
d476ce0
Correctly format docstrings for API reference
tomaarsen Nov 28, 2023
dac5221
Also maybe log, evaluate & save at epoch end
tomaarsen Nov 28, 2023
5edf540
Update README in preparation for documentation
tomaarsen Nov 28, 2023
c1b2f20
Link to scripts rather than scripts/setfit
tomaarsen Nov 28, 2023
70bd935
Ensure correct device of "best model at the end"
tomaarsen Nov 28, 2023
c93b55a
Add "labels" in a configuration file
tomaarsen Nov 28, 2023
4c0f152
Resolve flake issues
tomaarsen Nov 28, 2023
1af337f
Add labels to migration guide
tomaarsen Nov 29, 2023
3876d62
Update returns docstring for predict & __call__
tomaarsen Nov 29, 2023
71be7a5
Use ndim rather than "multi_target_strategy is None"
tomaarsen Nov 29, 2023
298fe39
Merge pull request #447 from tomaarsen/feat/configuration
tomaarsen Nov 29, 2023
cc97d10
Allow passing strings to model.predict
tomaarsen Nov 29, 2023
d85d537
Merge pull request #448 from tomaarsen/feat/predict_singular
tomaarsen Nov 29, 2023
62f7eea
Allow partial column mappings
tomaarsen Nov 29, 2023
6f226e5
Allow normalize_embeddings with diff head
tomaarsen Nov 29, 2023
f04e997
Merge pull request #449 from tomaarsen/feat/partial_col_mapping
tomaarsen Nov 29, 2023
f021e13
Merge pull request #450 from tomaarsen/fix/normalize_with_diff_head
tomaarsen Nov 29, 2023
313bffc
Update phrasing in SetFit intro
tomaarsen Dec 1, 2023
9976bb5
Heavily improve automatic model card generation
tomaarsen Nov 29, 2023
bbad20d
Rewrite first paragraph somewhat
tomaarsen Nov 29, 2023
6cd51ed
Resolve issue with multi-label
tomaarsen Nov 30, 2023
4a6852b
Set inference=False for multilabel models
tomaarsen Nov 30, 2023
671611e
Add model card tests
tomaarsen Nov 30, 2023
5f36d0e
Reformat
tomaarsen Nov 30, 2023
086ee02
Satisfy flake8
tomaarsen Nov 30, 2023
4990b09
Make model card generation more robust
tomaarsen Dec 1, 2023
58d5815
Allow compute_metric to return a non-dict
tomaarsen Dec 1, 2023
61cf947
Update tests as datasets are now column-mapped at init
tomaarsen Dec 1, 2023
751ba80
Avoid bare except
tomaarsen Dec 1, 2023
4a7255d
Avoid walrus operator for now for Python 3.7 compat
tomaarsen Dec 1, 2023
8032131
Increase minimal datasets version for dataset inferring
tomaarsen Dec 1, 2023
54d7127
Keep datasets version low, but skip test if < 2.14
tomaarsen Dec 1, 2023
87420b3
Add reason to skipif
tomaarsen Dec 1, 2023
a73cb69
Always return dicts in id2label/label2id
tomaarsen Dec 4, 2023
859691b
Introduce "no aspect", "aspect" labels for AspectModel
tomaarsen Dec 4, 2023
f9e6acb
Extend model card generation to ABSA + Tests
tomaarsen Dec 4, 2023
4c4a9aa
Correctly use create_model_card in ABSA test
tomaarsen Dec 4, 2023
0beedf2
Speed up model card tests for ABSA
tomaarsen Dec 4, 2023
55e9380
Set default W&B project as "setfit" if not set via ENV var yet
tomaarsen Dec 4, 2023
8ad41a8
Run formatting
tomaarsen Dec 4, 2023
a65a4e7
Remove the old ABSA model card template
tomaarsen Dec 4, 2023
d0cda23
Set fsspec<2023.12.0 due to breakages with older datasets
tomaarsen Dec 4, 2023
7dcc35e
Make some model_card_data modifications for ABSA only once
tomaarsen Dec 4, 2023
2cd004f
Reorder arguments
tomaarsen Dec 4, 2023
3a5356e
Update absa models in docs
tomaarsen Dec 4, 2023
d513064
Move import
tomaarsen Dec 4, 2023
1a60d09
Add model_card_data to from_pretrained
tomaarsen Dec 4, 2023
681f8db
Remove useless brackets
tomaarsen Dec 4, 2023
d61ec69
Correct model_card docstring
tomaarsen Dec 4, 2023
77aff7c
Only use the gold aspects/labels for training the polarity model
tomaarsen Dec 4, 2023
2c09cfb
Merge branch 'v1.0.0-pre' of https://github.com/huggingface/setfit in…
tomaarsen Dec 4, 2023
9dbca0b
Use text classification dataset examples
tomaarsen Dec 4, 2023
368155c
Add model card generation documentation
tomaarsen Dec 4, 2023
9c87685
Add spaCy version to ABSA model card
tomaarsen Dec 5, 2023
b257e82
Map to int to avoid potential warning
tomaarsen Dec 5, 2023
3a6e23a
Store used spaCy model configuration in aspect/polarity model
tomaarsen Dec 5, 2023
3bc0125
Correctly test against log
tomaarsen Dec 5, 2023
35c7461
Reformat test imports
tomaarsen Dec 5, 2023
5d04965
Try to resolve failing test on CI
tomaarsen Dec 5, 2023
815e45a
debugging: test against trainer dataset size
tomaarsen Dec 5, 2023
267e21d
Ignore log tests
tomaarsen Dec 5, 2023
243fcb2
Add 'eval_max_steps', reduce load time before train
tomaarsen Dec 5, 2023
8dd930c
Merge branch 'main' of https://github.com/huggingface/setfit into v1.…
tomaarsen Dec 5, 2023
c039e17
Also pass metric_kwargs to custom metric callable
tomaarsen Dec 5, 2023
8d5fc46
Merge pull request #456 from tomaarsen/feat/use_metric_kwargs_with_cu…
tomaarsen Dec 5, 2023
37592eb
Use gold aspects as True, and non-overlapping pred aspects as False
tomaarsen Dec 6, 2023
522a420
Add missing +1 on edge case in Aspect Extractor
tomaarsen Dec 6, 2023
ebaf5a2
Update ABSA documentation slightly
tomaarsen Dec 6, 2023
937c491
Specify AbsaTrainer methods
tomaarsen Dec 6, 2023
3152e49
Update v1.0.0 migration; expand changelog
tomaarsen Dec 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/build_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ jobs:
with:
commit_sha: ${{ github.sha }}
package: setfit
notebook_folder: setfit_doc
languages: en
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/quality.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,12 @@ on:
branches:
- main
- v*-release
- v*-pre
pull_request:
branches:
- main
- v*-pre
workflow_dispatch:

jobs:

Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,12 @@ on:
branches:
- main
- v*-release
- v*-pre
pull_request:
branches:
- main
- v*-pre
workflow_dispatch:

jobs:

Expand Down Expand Up @@ -40,6 +43,9 @@ jobs:
run: |
python -m pip install --no-cache-dir --upgrade pip
python -m pip install --no-cache-dir ${{ matrix.requirements }}
python -m pip install '.[codecarbon]'
python -m spacy download en_core_web_lg
python -m spacy download en_core_web_sm
if: steps.restore-cache.outputs.cache-hit != 'true'

- name: Install the checked-out setfit
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -149,3 +149,7 @@ scripts/tfew/run_tmux.sh
# macOS
.DS_Store
.vscode/settings.json

# Common SetFit Trainer logging folders
wandb
runs/
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include src/setfit/model_card_template.md
337 changes: 43 additions & 294 deletions README.md

Large diffs are not rendered by default.

34 changes: 5 additions & 29 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0
https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
Expand Down Expand Up @@ -78,7 +78,7 @@ The `preview` command only works with existing doc files. When you add a complet
Accepted files are Markdown (.md or .mdx).

Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/setfit/blob/main/docs/source/_toctree.yml) file.
the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/setfit/blob/main/docs/source/en/_toctree.yml) file.

## Renaming section headers and moving sections

Expand All @@ -103,7 +103,7 @@ Sections that were moved:

Use the relative style to link to the new file so that the versioned docs continue to work.

For an example of a rich moved section set please see the very end of [the Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.mdx).
For an example of a rich moved section set please see the very end of [the Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md).


## Writing Documentation - Specification
Expand All @@ -123,34 +123,10 @@ Make sure to put your new file under the proper section. It's unlikely to go in
depending on the intended targets (beginners, more advanced users, or researchers) it should go in sections two, three, or
four.

### Translating

When translating, refer to the guide at [./TRANSLATING.md](https://github.com/huggingface/setfit/blob/main/docs/TRANSLATING.md).
### Autodoc


### Adding a new model

When adding a new model:

- Create a file `xxx.mdx` or under `./source/model_doc` (don't hesitate to copy an existing file as template).
- Link that file in `./source/_toctree.yml`.
- Write a short overview of the model:
- Overview with paper & authors
- Paper abstract
- Tips and tricks and how to use it best
- Add the classes that should be linked in the model. This generally includes the configuration, the tokenizer, and
every model of that class (the base model, alongside models with additional heads), both in PyTorch and TensorFlow.
The order is generally:
- Configuration,
- Tokenizer
- PyTorch base model
- PyTorch head models
- TensorFlow base model
- TensorFlow head models
- Flax base model
- Flax head models

These classes should be added using our Markdown syntax. Usually as follows:
The following are some examples of `[[autodoc]]` for documentation building.

```
## XXXConfig
Expand Down
9 changes: 9 additions & 0 deletions docs/source/_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# docstyle-ignore
INSTALL_CONTENT = """
# SetFit installation
! pip install setfit
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/setfit.git
"""

notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
50 changes: 41 additions & 9 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,53 @@
- local: installation
title: Installation
title: Get started

- sections:
- local: tutorials/placeholder
title: Placeholder
- local: tutorials/overview
title: Overview
- local: tutorials/zero_shot
title: Zero-shot Text Classification
- local: tutorials/onnx
title: Efficiently run SetFit with ONNX
title: Tutorials

- sections:
- local: how_to/placeholder
title: Placeholder
- local: how_to/overview
title: Overview
- local: how_to/callbacks
title: Callbacks
- local: how_to/model_cards
title: Model Cards
- local: how_to/classification_heads
title: Classification Heads
- local: how_to/multilabel
title: Multilabel Text Classification
- local: how_to/zero_shot
title: Zero-shot Text Classification
- local: how_to/hyperparameter_optimization
title: Hyperparameter Optimization
- local: how_to/knowledge_distillation
title: Knowledge Distillation
- local: how_to/batch_sizes
title: Batch Sizes for Inference
- local: how_to/absa
title: Aspect Based Sentiment Analysis
- local: how_to/v1.0.0_migration_guide
title: v1.0.0 Migration Guide
title: How-to Guides

- sections:
- local: conceptual_guides/placeholder
title: Placeholder
- local: conceptual_guides/setfit
title: SetFit
- local: conceptual_guides/sampling_strategies
title: Sampling Strategies
title: Conceptual Guides

- sections:
- local: api/main
- local: reference/main
title: Main classes
- local: api/trainer
- local: reference/trainer
title: Trainer classes
title: API
- local: reference/utility
title: Utility
title: Reference
8 changes: 0 additions & 8 deletions docs/source/en/api/main.mdx

This file was deleted.

8 changes: 0 additions & 8 deletions docs/source/en/api/trainer.mdx

This file was deleted.

3 changes: 0 additions & 3 deletions docs/source/en/conceptual_guides/placeholder.mdx

This file was deleted.

87 changes: 87 additions & 0 deletions docs/source/en/conceptual_guides/sampling_strategies.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@

# SetFit Sampling Strategies

SetFit supports various contrastive pair sampling strategies in [`TrainingArguments`]. In this conceptual guide, we will learn about the following four sampling strategies:

1. `"oversampling"` (the default)
2. `"undersampling"`
3. `"unique"`
4. `"num_iterations"`

Consider first reading the [SetFit conceptual guide](../setfit) for a background on contrastive learning and positive & negative pairs.

## Running example

Throughout this conceptual guide, we will use to the following example scenario:

* 3 classes: "happy", "content", and "sad".
* 20 total samples: 8 "happy", 4 "content", and 8 "sad" samples.

Considering that a sentence pair of `(X, Y)` and `(Y, X)` result in the same embedding distance/loss, we only want to consider one of those two cases. Furthermore, we don't want pairs where both sentences are the same, e.g. no `(X, X)`.

The resulting positive and negative pairs can be visualized in a table like below. The `+` and `-` represent positive and negative pairs, respectively. Furthermore, `h-n` represents the n-th "happy" sentence, `c-n` the n-th "content" sentence, and `s-n` the n-th "sad" sentence. Note that the area below the diagonal is not used as `(X, Y)` and `(Y, X)` result in the same embedding distances, and that the diagonal is not used as we are not interested in pairs where both sentences are identical.

| |h-1|h-2|h-3|h-4|h-5|h-6|h-7|h-8|c-1|c-2|c-3|c-4|s-1|s-2|s-3|s-4|s-5|s-6|s-7|s-8|
|-------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|**h-1**| | + | + | + | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
|**h-2**| | | + | + | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
|**h-3**| | | | + | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
|**h-4**| | | | | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
|**h-5**| | | | | | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
|**h-6**| | | | | | | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
|**h-7**| | | | | | | | + | - | - | - | - | - | - | - | - | - | - | - | - |
|**h-8**| | | | | | | | | - | - | - | - | - | - | - | - | - | - | - | - |
|**c-1**| | | | | | | | | | + | + | + | - | - | - | - | - | - | - | - |
|**c-2**| | | | | | | | | | | + | + | - | - | - | - | - | - | - | - |
|**c-3**| | | | | | | | | | | | + | - | - | - | - | - | - | - | - |
|**c-4**| | | | | | | | | | | | | - | - | - | - | - | - | - | - |
|**s-1**| | | | | | | | | | | | | | + | + | + | + | + | + | + |
|**s-2**| | | | | | | | | | | | | | | + | + | + | + | + | + |
|**s-3**| | | | | | | | | | | | | | | | + | + | + | + | + |
|**s-4**| | | | | | | | | | | | | | | | | + | + | + | + |
|**s-5**| | | | | | | | | | | | | | | | | | + | + | + |
|**s-6**| | | | | | | | | | | | | | | | | | | + | + |
|**s-7**| | | | | | | | | | | | | | | | | | | | + |
|**s-8**| | | | | | | | | | | | | | | | | | | | |

As shown in the prior table, we have 28 positive pairs for "happy", 6 positive pairs for "content", and another 28 positive pairs for "sad". In total, this is 62 positive pairs. Also, we have 32 negative pairs between "happy" and "content", 64 negative pairs between "happy" and "sad", and 32 negative pairs between "content" and "sad". In total, this is 128 negative pairs.

## Oversampling

By default, SetFit applies the oversampling strategy for its contrastive pairs. This strategy samples an equal amount of positive and negative training pairs, oversampling the minority pair type to match that of the majority pair type. As the number of negative pairs is generally larger than the number of positive pairs, this usually involves oversampling the positive pairs.

In our running example, this would involve oversampling the 62 positive pairs up to 128, resulting in one epoch of 128 + 128 = 256 pairs. In summary:

* ✅ An equal amount of positive and negative pairs are sampled.
* ✅ Every possible pair is used.
* ❌ There is some data duplication.

## Undersampling

Like oversampling, this strategy samples an equal amount of positive and negative training pairs. However, it undersamples the majority pair type to match that of the minority pair type. This usually involves undersampling the negative pairs to match the positive pairs.

In our running example, this would involve undersampling the 128 negative pairs down to 62, resulting in one epoch of 62 + 62 = 124 pairs. In summary:

* ✅ An equal amount of positive and negative pairs are sampled.
* ❌ **Not** every possible pair is used.
* ✅ There is **no** data duplication.

## Unique

Thirdly, the unique strategy does not sample an equal amount of positive and negative training pairs. Instead, it simply samples all possible pairs exactly once. No form of oversampling or undersampling is used here.

In our running example, this would involve sampling all negative and positive pairs, resulting in one epoch of 62 + 128 = 190 pairs. In summary:

* ❌ **Not** an equal amount of positive and negative pairs are sampled.
* ✅ Every possible pair is used.
* ✅ There is **no** data duplication.

## `num_iterations`

Lastly, SetFit can still be used with a deprecated sampling strategy involving the `num_iterations` training argument. Unlike the other sampling strategies, this strategy does not involve the number of possible pairs. Instead, it samples `num_iterations` positive pairs and `num_iterations` negative pairs for each training sample.

In our running example, if we assume `num_iterations=20`, then we would sample 20 positive pairs and 20 negative pairs per training sample. Because there's 20 samples, this involves (20 + 20) * 20 = 800 pairs. Because there are only 190 unique pairs, this certainly involves some data duplication. However, it does not guarantee that every possible pair is used. In summary:

* ✅ **Not** an equal amount of positive and negative pairs are sampled.
* ❌ Not necessarily every possible pair is used.
* ❌ There is some data duplication.
28 changes: 28 additions & 0 deletions docs/source/en/conceptual_guides/setfit.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

# Sentence Transformers Finetuning (SetFit)

SetFit is a model framework to efficiently train text classification models with surprisingly little training data. For example, with only 8 labeled examples per class on the Customer Reviews (CR) sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples. Furthermore, SetFit is fast to train and run inference with, and can easily support multilingual tasks.

Every SetFit model consists of two parts: a **sentence transformer** embedding model (the body) and a **classifier** (the head). These two parts are trained in two separate phases: the **embedding finetuning phase** and the **classifier training phase**. This conceptual guide will elaborate on the intuition between these phases, and why SetFit works so well.

## Embedding finetuning phase

The first phase has one primary goal: finetune a sentence transformer embedding model to produce useful embeddings for *our* classification task. The [Hugging Face Hub](https://huggingface.co/models?library=sentence-transformers) already has thousands of sentence transformer available, many of which have been trained to very accurately group the embeddings of texts with similar semantic meaning.

However, models that are good at Semantic Textual Similarity (STS) are not necessarily immediately good at *our* classification task. For example, according to an embedding model, the sentence of 1) `"He biked to work."` will be much more similar to 2) `"He drove his car to work."` than to 3) `"Peter decided to take the bicycle to the beach party!"`. But if our classification task involves classifying texts into transportation modes, then we want our embedding model to place sentences 1 and 3 closely together, and 2 further away.

To do so, we can finetune the chosen sentence transformer embedding model. The goal here is to nudge the model to use its pretrained knowledge in a different way that better aligns with our classification task, rather than making the completely forget what it has learned.

For finetuning, SetFit uses **contrastive learning**. This training approach involves creating **positive and negative pairs** of sentences. A sentence pair will be positive if both of the sentences are of the same class, and negative otherwise. For example, in the case of binary "positive"-"negative" sentiment analysis, `("The movie was awesome", "I loved it")` is a positive pair, and `("The movie was awesome", "It was quite disappointing")` is a negative pair.

During training, the embedding model receives these pairs, and will convert the sentences to embeddings. If the pair is positive, then it will pull on the model weights such that the text embeddings will be more similar, and vice versa for a negative pair. Through this approach, sentences with the same label will be embedded more similarly, and sentences with different labels less similarly.

Conveniently, this contrastive learning works with pairs rather than individual samples, and we can create plenty of unique pairs from just a few samples. For example, given 8 positive sentences and 8 negative sentences, we can create 28 positive pairs and 64 negative pairs for 92 unique training pairs. This grows exponentially to the number of sentences and classes, and that is why SetFit can train with just a few examples and still correctly finetune the sentence transformer embedding model. However, we should still be wary of overfitting.

## Classifier training phase

Once the sentence transformer embedding model has been finetuned for our task at hand, we can start training the classifier. This phase has one primary goal: create a good mapping from the sentence transformer embeddings to the classes.

Unlike with the first phase, training the classifier is done from scratch and using the labeled samples directly, rather than using pairs. By default, the classifier is a simple **logistic regression** classifier from scikit-learn. First, all training sentences are fed through the now-finetuned sentence transformer embedding model, and then the sentence embeddings and labels are used to fit the logistic regression classifier. The result is a strong and efficient classifier.

Using these two parts, SetFit models are efficient, performant and easy to train, even on CPU-only devices.
Loading