huggingface · tomaarsen · Dec 6, 2023 · Jan 11, 2023 · Jan 11, 2023 · Jan 11, 2023
diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml
@@ -13,6 +13,7 @@ jobs:
     with:
       commit_sha: ${{ github.sha }}
       package: setfit
+      notebook_folder: setfit_doc
       languages: en
     secrets:
       token: ${{ secrets.HUGGINGFACE_PUSH }}

diff --git a/.github/workflows/quality.yml b/.github/workflows/quality.yml
@@ -5,9 +5,12 @@ on:
     branches:
       - main
       - v*-release
+      - v*-pre
   pull_request:
     branches:
       - main
+      - v*-pre
+  workflow_dispatch:
 
 jobs:
 

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -5,9 +5,12 @@ on:
     branches:
       - main
       - v*-release
+      - v*-pre
   pull_request:
     branches:
       - main
+      - v*-pre
+  workflow_dispatch:
 
 jobs:
 
@@ -40,6 +43,9 @@ jobs:
         run: |
           python -m pip install --no-cache-dir --upgrade pip
           python -m pip install --no-cache-dir ${{ matrix.requirements }}
+          python -m pip install '.[codecarbon]'
+          python -m spacy download en_core_web_lg
+          python -m spacy download en_core_web_sm
         if: steps.restore-cache.outputs.cache-hit != 'true'
 
       - name: Install the checked-out setfit

diff --git a/.gitignore b/.gitignore
@@ -149,3 +149,7 @@ scripts/tfew/run_tmux.sh
 # macOS
 .DS_Store
 .vscode/settings.json
+
+# Common SetFit Trainer logging folders
+wandb
+runs/
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1 @@
+include src/setfit/model_card_template.md
diff --git a/README.md b/README.md
diff --git a/docs/README.md b/docs/README.md
@@ -5,7 +5,7 @@ Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
 
-    http://www.apache.org/licenses/LICENSE-2.0
+    https://www.apache.org/licenses/LICENSE-2.0
 
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
@@ -78,7 +78,7 @@ The `preview` command only works with existing doc files. When you add a complet
 Accepted files are Markdown (.md or .mdx).
 
 Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
-the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/setfit/blob/main/docs/source/_toctree.yml) file.
+the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/setfit/blob/main/docs/source/en/_toctree.yml) file.
 
 ## Renaming section headers and moving sections
 
@@ -103,7 +103,7 @@ Sections that were moved:
 
 Use the relative style to link to the new file so that the versioned docs continue to work.
 
-For an example of a rich moved section set please see the very end of [the Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.mdx).
+For an example of a rich moved section set please see the very end of [the Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md).
 
 
 ## Writing Documentation - Specification
@@ -123,34 +123,10 @@ Make sure to put your new file under the proper section. It's unlikely to go in
 depending on the intended targets (beginners, more advanced users, or researchers) it should go in sections two, three, or
 four.
 
-### Translating
 
-When translating, refer to the guide at [./TRANSLATING.md](https://github.com/huggingface/setfit/blob/main/docs/TRANSLATING.md).
+### Autodoc
 
-
-### Adding a new model
-
-When adding a new model:
-
-- Create a file `xxx.mdx` or under `./source/model_doc` (don't hesitate to copy an existing file as template).
-- Link that file in `./source/_toctree.yml`.
-- Write a short overview of the model:
-    - Overview with paper & authors
-    - Paper abstract
-    - Tips and tricks and how to use it best
-- Add the classes that should be linked in the model. This generally includes the configuration, the tokenizer, and
-  every model of that class (the base model, alongside models with additional heads), both in PyTorch and TensorFlow.
-  The order is generally:
-    - Configuration,
-    - Tokenizer
-    - PyTorch base model
-    - PyTorch head models
-    - TensorFlow base model
-    - TensorFlow head models
-    - Flax base model
-    - Flax head models
-
-These classes should be added using our Markdown syntax. Usually as follows:
+The following are some examples of `[[autodoc]]` for documentation building.
 
 ```
 ## XXXConfig

diff --git a/docs/source/_config.py b/docs/source/_config.py
@@ -0,0 +1,9 @@
+# docstyle-ignore
+INSTALL_CONTENT = """
+# SetFit installation
+! pip install setfit
+# To install from source instead of the last release, comment the command above and uncomment the following one.
+# ! pip install git+https://github.com/huggingface/setfit.git
+"""
+
+notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -6,21 +6,53 @@
   - local: installation
     title: Installation
   title: Get started
+
 - sections:
-  - local: tutorials/placeholder
-    title: Placeholder
+  - local: tutorials/overview
+    title: Overview
+  - local: tutorials/zero_shot
+    title: Zero-shot Text Classification
+  - local: tutorials/onnx
+    title: Efficiently run SetFit with ONNX
   title: Tutorials
+
 - sections:
-  - local: how_to/placeholder
-    title: Placeholder
+  - local: how_to/overview
+    title: Overview
+  - local: how_to/callbacks
+    title: Callbacks
+  - local: how_to/model_cards
+    title: Model Cards
+  - local: how_to/classification_heads
+    title: Classification Heads
+  - local: how_to/multilabel
+    title: Multilabel Text Classification
+  - local: how_to/zero_shot
+    title: Zero-shot Text Classification
+  - local: how_to/hyperparameter_optimization
+    title: Hyperparameter Optimization
+  - local: how_to/knowledge_distillation
+    title: Knowledge Distillation
+  - local: how_to/batch_sizes
+    title: Batch Sizes for Inference
+  - local: how_to/absa
+    title: Aspect Based Sentiment Analysis
+  - local: how_to/v1.0.0_migration_guide
+    title: v1.0.0 Migration Guide
   title: How-to Guides
+
 - sections:
-  - local: conceptual_guides/placeholder
-    title: Placeholder
+  - local: conceptual_guides/setfit
+    title: SetFit
+  - local: conceptual_guides/sampling_strategies
+    title: Sampling Strategies
   title: Conceptual Guides
+
 - sections:
-  - local: api/main
+  - local: reference/main
     title: Main classes
-  - local: api/trainer
+  - local: reference/trainer
     title: Trainer classes
-  title: API
+  - local: reference/utility
+    title: Utility
+  title: Reference
diff --git a/docs/source/en/api/main.mdx b/docs/source/en/api/main.mdx
diff --git a/docs/source/en/api/trainer.mdx b/docs/source/en/api/trainer.mdx
diff --git a/docs/source/en/conceptual_guides/placeholder.mdx b/docs/source/en/conceptual_guides/placeholder.mdx
diff --git a/docs/source/en/conceptual_guides/sampling_strategies.mdx b/docs/source/en/conceptual_guides/sampling_strategies.mdx
@@ -0,0 +1,87 @@
+
+# SetFit Sampling Strategies
+
+SetFit supports various contrastive pair sampling strategies in [`TrainingArguments`]. In this conceptual guide, we will learn about the following four sampling strategies:
+
+1. `"oversampling"` (the default)
+2. `"undersampling"`
+3. `"unique"`
+4. `"num_iterations"`
+
+Consider first reading the [SetFit conceptual guide](../setfit) for a background on contrastive learning and positive & negative pairs.
+
+## Running example
+
+Throughout this conceptual guide, we will use to the following example scenario:
+
+* 3 classes: "happy", "content", and "sad".
+* 20 total samples: 8 "happy", 4 "content", and 8 "sad" samples.
+
+Considering that a sentence pair of `(X, Y)` and `(Y, X)` result in the same embedding distance/loss, we only want to consider one of those two cases. Furthermore, we don't want pairs where both sentences are the same, e.g. no `(X, X)`. 
+
+The resulting positive and negative pairs can be visualized in a table like below. The `+` and `-` represent positive and negative pairs, respectively. Furthermore, `h-n` represents the n-th "happy" sentence, `c-n` the n-th "content" sentence, and `s-n` the n-th "sad" sentence. Note that the area below the diagonal is not used as `(X, Y)` and `(Y, X)` result in the same embedding distances, and that the diagonal is not used as we are not interested in pairs where both sentences are identical.
+
+|       |h-1|h-2|h-3|h-4|h-5|h-6|h-7|h-8|c-1|c-2|c-3|c-4|s-1|s-2|s-3|s-4|s-5|s-6|s-7|s-8|
+|-------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
+|**h-1**|   | + | + | + | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
+|**h-2**|   |   | + | + | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
+|**h-3**|   |   |   | + | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
+|**h-4**|   |   |   |   | + | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
+|**h-5**|   |   |   |   |   | + | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
+|**h-6**|   |   |   |   |   |   | + | + | - | - | - | - | - | - | - | - | - | - | - | - |
+|**h-7**|   |   |   |   |   |   |   | + | - | - | - | - | - | - | - | - | - | - | - | - |
+|**h-8**|   |   |   |   |   |   |   |   | - | - | - | - | - | - | - | - | - | - | - | - |
+|**c-1**|   |   |   |   |   |   |   |   |   | + | + | + | - | - | - | - | - | - | - | - |
+|**c-2**|   |   |   |   |   |   |   |   |   |   | + | + | - | - | - | - | - | - | - | - |
+|**c-3**|   |   |   |   |   |   |   |   |   |   |   | + | - | - | - | - | - | - | - | - |
+|**c-4**|   |   |   |   |   |   |   |   |   |   |   |   | - | - | - | - | - | - | - | - |
+|**s-1**|   |   |   |   |   |   |   |   |   |   |   |   |   | + | + | + | + | + | + | + |
+|**s-2**|   |   |   |   |   |   |   |   |   |   |   |   |   |   | + | + | + | + | + | + |
+|**s-3**|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | + | + | + | + | + |
+|**s-4**|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | + | + | + | + |
+|**s-5**|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | + | + | + |
+|**s-6**|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | + | + |
+|**s-7**|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | + |
+|**s-8**|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
+
+As shown in the prior table, we have 28 positive pairs for "happy", 6 positive pairs for "content", and another 28 positive pairs for "sad". In total, this is 62 positive pairs. Also, we have 32 negative pairs between "happy" and "content", 64 negative pairs between "happy" and "sad", and 32 negative pairs between "content" and "sad". In total, this is 128 negative pairs.
+
+## Oversampling
+
+By default, SetFit applies the oversampling strategy for its contrastive pairs. This strategy samples an equal amount of positive and negative training pairs, oversampling the minority pair type to match that of the majority pair type. As the number of negative pairs is generally larger than the number of positive pairs, this usually involves oversampling the positive pairs.
+
+In our running example, this would involve oversampling the 62 positive pairs up to 128, resulting in one epoch of 128 + 128 = 256 pairs. In summary:
+
+* ✅ An equal amount of positive and negative pairs are sampled.
+* ✅ Every possible pair is used.
+* ❌ There is some data duplication.
+
+## Undersampling
+
+Like oversampling, this strategy samples an equal amount of positive and negative training pairs. However, it undersamples the majority pair type to match that of the minority pair type. This usually involves undersampling the negative pairs to match the positive pairs.
+
+In our running example, this would involve undersampling the 128 negative pairs down to 62, resulting in one epoch of 62 + 62 = 124 pairs. In summary:
+
+* ✅ An equal amount of positive and negative pairs are sampled.
+* ❌ **Not** every possible pair is used.
+* ✅ There is **no** data duplication.
+
+## Unique
+
+Thirdly, the unique strategy does not sample an equal amount of positive and negative training pairs. Instead, it simply samples all possible pairs exactly once. No form of oversampling or undersampling is used here.
+
+In our running example, this would involve sampling all negative and positive pairs, resulting in one epoch of 62 + 128 = 190 pairs. In summary:
+
+* ❌ **Not** an equal amount of positive and negative pairs are sampled.
+* ✅ Every possible pair is used.
+* ✅ There is **no** data duplication.
+
+## `num_iterations`
+
+Lastly, SetFit can still be used with a deprecated sampling strategy involving the `num_iterations` training argument. Unlike the other sampling strategies, this strategy does not involve the number of possible pairs. Instead, it samples `num_iterations` positive pairs and `num_iterations` negative pairs for each training sample. 
+
+In our running example, if we assume `num_iterations=20`, then we would sample 20 positive pairs and 20 negative pairs per training sample. Because there's 20 samples, this involves (20 + 20) * 20 = 800 pairs. Because there are only 190 unique pairs, this certainly involves some data duplication. However, it does not guarantee that every possible pair is used. In summary:
+
+* ✅ **Not** an equal amount of positive and negative pairs are sampled.
+* ❌ Not necessarily every possible pair is used.
+* ❌ There is some data duplication.
diff --git a/docs/source/en/conceptual_guides/setfit.mdx b/docs/source/en/conceptual_guides/setfit.mdx
@@ -0,0 +1,28 @@
+
+# Sentence Transformers Finetuning (SetFit)
+
+SetFit is a model framework to efficiently train text classification models with surprisingly little training data. For example, with only 8 labeled examples per class on the Customer Reviews (CR) sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples. Furthermore, SetFit is fast to train and run inference with, and can easily support multilingual tasks. 
+
+Every SetFit model consists of two parts: a **sentence transformer** embedding model (the body) and a **classifier** (the head). These two parts are trained in two separate phases: the **embedding finetuning phase** and the **classifier training phase**. This conceptual guide will elaborate on the intuition between these phases, and why SetFit works so well.
+
+## Embedding finetuning phase
+
+The first phase has one primary goal: finetune a sentence transformer embedding model to produce useful embeddings for *our* classification task. The [Hugging Face Hub](https://huggingface.co/models?library=sentence-transformers) already has thousands of sentence transformer available, many of which have been trained to very accurately group the embeddings of texts with similar semantic meaning.
+
+However, models that are good at Semantic Textual Similarity (STS) are not necessarily immediately good at *our* classification task. For example, according to an embedding model, the sentence of 1) `"He biked to work."` will be much more similar to 2) `"He drove his car to work."` than to 3) `"Peter decided to take the bicycle to the beach party!"`. But if our classification task involves classifying texts into transportation modes, then we want our embedding model to place sentences 1 and 3 closely together, and 2 further away.
+
+To do so, we can finetune the chosen sentence transformer embedding model. The goal here is to nudge the model to use its pretrained knowledge in a different way that better aligns with our classification task, rather than making the completely forget what it has learned. 
+
+For finetuning, SetFit uses **contrastive learning**. This training approach involves creating **positive and negative pairs** of sentences. A sentence pair will be positive if both of the sentences are of the same class, and negative otherwise. For example, in the case of binary "positive"-"negative" sentiment analysis, `("The movie was awesome", "I loved it")` is a positive pair, and `("The movie was awesome", "It was quite disappointing")` is a negative pair.
+
+During training, the embedding model receives these pairs, and will convert the sentences to embeddings. If the pair is positive, then it will pull on the model weights such that the text embeddings will be more similar, and vice versa for a negative pair. Through this approach, sentences with the same label will be embedded more similarly, and sentences with different labels less similarly.
+
+Conveniently, this contrastive learning works with pairs rather than individual samples, and we can create plenty of unique pairs from just a few samples. For example, given 8 positive sentences and 8 negative sentences, we can create 28 positive pairs and 64 negative pairs for 92 unique training pairs. This grows exponentially to the number of sentences and classes, and that is why SetFit can train with just a few examples and still correctly finetune the sentence transformer embedding model. However, we should still be wary of overfitting.
+
+## Classifier training phase
+
+Once the sentence transformer embedding model has been finetuned for our task at hand, we can start training the classifier. This phase has one primary goal: create a good mapping from the sentence transformer embeddings to the classes.
+
+Unlike with the first phase, training the classifier is done from scratch and using the labeled samples directly, rather than using pairs. By default, the classifier is a simple **logistic regression** classifier from scikit-learn. First, all training sentences are fed through the now-finetuned sentence transformer embedding model, and then the sentence embeddings and labels are used to fit the logistic regression classifier. The result is a strong and efficient classifier. 
+
+Using these two parts, SetFit models are efficient, performant and easy to train, even on CPU-only devices.