Skip to content

Commit 63f948b

Browse files
authored
distilabel 1.3.0
2 parents cf119ad + 3690bd6 commit 63f948b

File tree

162 files changed

+9030
-1129
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

162 files changed

+9030
-1129
lines changed

.github/workflows/codspeed.yml

+1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ on:
44
push:
55
branches:
66
- "main"
7+
- "develop"
78
pull_request:
89

910
concurrency:

.github/workflows/docs-pr-close.yml

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: Clean up PR documentation
2+
3+
on:
4+
pull_request:
5+
types: [closed]
6+
7+
concurrency: distilabel-docs
8+
9+
jobs:
10+
cleanup:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- name: Checkout merged branch
14+
uses: actions/checkout@v4
15+
with:
16+
ref: ${{ github.event.pull_request.base.ref }}
17+
fetch-depth: 0
18+
19+
- name: Setup Python
20+
uses: actions/setup-python@v4
21+
with:
22+
python-version: ${{ matrix.python-version }}
23+
24+
- name: Install dependencies
25+
run: pip install -e .[docs]
26+
27+
- name: Set git credentials
28+
run: |
29+
git config --global user.name "${{ github.actor }}"
30+
git config --global user.email "${{ github.actor }}@users.noreply.github.com"
31+
32+
- name: Remove PR documentation
33+
run: |
34+
PR_NUMBER=${{ github.event.pull_request.number }}
35+
mike delete pr-$PR_NUMBER --push

.github/workflows/docs-pr.yml

+81
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
name: Publish PR documentation
2+
3+
on:
4+
pull_request:
5+
types:
6+
- opened
7+
- synchronize
8+
9+
concurrency: distilabel-docs
10+
11+
jobs:
12+
publish:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- name: checkout docs-site
16+
uses: actions/checkout@v4
17+
with:
18+
ref: gh-pages
19+
20+
- uses: actions/checkout@v4
21+
22+
- name: Setup Python
23+
uses: actions/setup-python@v4
24+
with:
25+
python-version: ${{ matrix.python-version }}
26+
# Looks like it's not working very well for other people:
27+
# https://github.com/actions/setup-python/issues/436
28+
# cache: "pip"
29+
# cache-dependency-path: pyproject.toml
30+
31+
- uses: actions/cache@v3
32+
id: cache
33+
with:
34+
path: ${{ env.pythonLocation }}
35+
key: ${{ runner.os }}-python-${{ env.pythonLocation }}-${{ hashFiles('pyproject.toml') }}-docs-pr-v00
36+
37+
- name: Install dependencies
38+
if: steps.cache.outputs.cache-hit != 'true'
39+
run: pip install -e .[docs]
40+
41+
- name: Set git credentials
42+
run: |
43+
git config --global user.name "${{ github.actor }}"
44+
git config --global user.email "${{ github.actor }}@users.noreply.github.com"
45+
46+
- name: Deploy hidden docs for PR
47+
run: |
48+
PR_NUMBER=$(echo $GITHUB_REF | awk 'BEGIN { FS = "/" } ; { print $3 }')
49+
mike deploy pr-$PR_NUMBER --prop-set hidden=true --push
50+
env:
51+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
52+
53+
- name: Comment PR with docs link
54+
uses: actions/github-script@v7
55+
with:
56+
script: |
57+
const pr_number = context.payload.pull_request.number;
58+
const owner = context.repo.owner;
59+
const repo = context.repo.repo;
60+
61+
// Check if a comment already exists
62+
const comments = await github.rest.issues.listComments({
63+
issue_number: pr_number,
64+
owner: owner,
65+
repo: repo
66+
});
67+
68+
const botComment = comments.data.find(comment =>
69+
comment.user.type === 'Bot' &&
70+
comment.body.includes('Documentation for this PR has been built')
71+
);
72+
73+
if (!botComment) {
74+
// Post new comment only if it doesn't exist
75+
await github.rest.issues.createComment({
76+
issue_number: pr_number,
77+
owner: owner,
78+
repo: repo,
79+
body: `Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-${pr_number}/`
80+
});
81+
}

.github/workflows/docs.yml

+5-3
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ on:
88
tags:
99
- "**"
1010

11+
concurrency: distilabel-docs
12+
1113
jobs:
1214
publish:
1315
runs-on: ubuntu-latest
@@ -32,7 +34,7 @@ jobs:
3234
id: cache
3335
with:
3436
path: ${{ env.pythonLocation }}
35-
key: ${{ runner.os }}-python-${{ env.pythonLocation }}-${{ hashFiles('pyproject.toml') }}-docs
37+
key: ${{ runner.os }}-python-${{ env.pythonLocation }}-${{ hashFiles('pyproject.toml') }}-docs-v00
3638

3739
- name: Install dependencies
3840
if: steps.cache.outputs.cache-hit != 'true'
@@ -46,9 +48,9 @@ jobs:
4648
- run: mike deploy dev --push
4749
if: github.ref == 'refs/heads/develop'
4850
env:
49-
GH_ACCESS_TOKEN: ${{ secrets.GH_ACCESS_TOKEN }}
51+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
5052

5153
- run: mike deploy ${{ github.ref_name }} latest --update-aliases --push
5254
if: startsWith(github.ref, 'refs/tags/')
5355
env:
54-
GH_ACCESS_TOKEN: ${{ secrets.GH_ACCESS_TOKEN }}
56+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

.github/workflows/test.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ jobs:
2525
runs-on: ubuntu-latest
2626
strategy:
2727
matrix:
28-
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
28+
python-version: ["3.9", "3.10", "3.11", "3.12"]
2929
fail-fast: false
3030

3131
steps:

README.md

+8-8
Original file line numberDiff line numberDiff line change
@@ -29,18 +29,18 @@
2929
</p>
3030

3131

32-
Distilabel is the **framework for synthetic data and AI feedback for AI engineers** that require **high-quality outputs, full data ownership, and overall efficiency**.
32+
Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
3333

3434
If you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!
3535
<!-- ![overview](https://github.com/argilla-io/distilabel/assets/36760800/360110da-809d-4e24-a29b-1a1a8bc4f9b7) -->
3636

37-
## Why use Distilabel?
37+
## Why use distilabel?
3838

39-
Whether you are working on **a predictive model** that computes semantic similarity or the next **generative model** that is going to beat the LLM benchmarks. Our framework ensures that the **hard data work pays off**. Distilabel is the missing piece that helps you **synthesize data** and provide **AI feedback**.
39+
Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.
4040

4141
### Improve your AI output quality through data quality
4242

43-
Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time on **achieveing and keeping high-quality standards for your data**.
43+
Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time **achieving and keeping high-quality standards for your data**.
4444

4545
### Take control of your data and models
4646

@@ -62,7 +62,7 @@ We are an open-source community-driven project and we love to hear from you. Her
6262

6363
## What do people build with Distilabel?
6464

65-
Distilabel is a tool that can be used to **synthesize data and provide AI feedback**. Our community uses Distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel), and **we love contributions to open-source** ourselves too.
65+
The Argilla community uses distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel).
6666

6767
- The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**.
6868
- Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B), show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.
@@ -74,7 +74,7 @@ Distilabel is a tool that can be used to **synthesize data and provide AI feedba
7474
pip install distilabel --upgrade
7575
```
7676

77-
Requires Python 3.8+
77+
Requires Python 3.9+
7878

7979
In addition, the following extras are available:
8080

@@ -105,14 +105,14 @@ Then run:
105105
```python
106106
from distilabel.llms import OpenAILLM
107107
from distilabel.pipeline import Pipeline
108-
from distilabel.steps import LoadHubDataset
108+
from distilabel.steps import LoadDataFromHub
109109
from distilabel.steps.tasks import TextGeneration
110110

111111
with Pipeline(
112112
name="simple-text-generation-pipeline",
113113
description="A simple text generation pipeline",
114114
) as pipeline:
115-
load_dataset = LoadHubDataset(output_mappings={"prompt": "instruction"})
115+
load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"})
116116

117117
generate_with_openai = TextGeneration(llm=OpenAILLM(model="gpt-3.5-turbo"))
118118

docs/api/mixins/requirements.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
::: distilabel.mixins.requirements.RequirementsMixin

docs/api/mixins/runtime_parameters.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
::: distilabel.mixins.runtime_parameters.RuntimeParametersMixin

docs/api/step/generator_step.md

+2
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,5 @@ This section contains the API reference for the [`GeneratorStep`][distilabel.ste
55
For more information and examples on how to use existing generator steps or create custom ones, please refer to [Tutorial - Step - GeneratorStep](../../sections/how_to_guides/basic/step/generator_step.md).
66

77
::: distilabel.steps.base.GeneratorStep
8+
9+
::: distilabel.steps.generators.utils.make_generator_step

docs/api/step/resources.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# StepResources
2+
3+
::: distilabel.steps.base.StepResources

docs/api/step_gallery/columns.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
This section contains the existing steps intended to be used for common column operations to apply to the batches.
44

5-
::: distilabel.steps.combine
6-
::: distilabel.steps.expand
7-
::: distilabel.steps.keep
5+
::: distilabel.steps.columns.expand
6+
::: distilabel.steps.columns.keep
7+
::: distilabel.steps.columns.merge
8+
::: distilabel.steps.columns.group

docs/index.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -36,29 +36,29 @@ hide:
3636
</a>
3737
</p>
3838

39-
Distilabel is the **framework for synthetic data and AI feedback for AI engineers** that require **high-quality outputs, full data ownership, and overall efficiency**.
39+
Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
4040

4141
If you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!
4242

43-
## Why use Distilabel?
43+
## Why use distilabel?
4444

45-
Whether you are working on **a predictive model** that computes semantic similarity or the next **generative model** that is going to beat the LLM benchmarks. Our framework ensures that the **hard data work pays off**. Distilabel is the missing piece that helps you **synthesize data** and provide **AI feedback**.
45+
Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.
4646

4747
### Improve your AI output quality through data quality
4848

49-
Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time on **achieveing and keeping high-quality standards for your data**.
49+
Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time **achieving and keeping high-quality standards for your synthetic data**.
5050

5151
### Take control of your data and models
5252

53-
**Ownership of data for fine-tuning your own LLMs** is not easy but Distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.
53+
**Ownership of data for fine-tuning your own LLMs** is not easy but distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.
5454

5555
### Improve efficiency by quickly iterating on the right research and LLMs
5656

5757
Synthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.
5858

59-
## What do people build with Distilabel?
59+
## What do people build with distilabel?
6060

61-
Distilabel is a tool that can be used to **synthesize data and provide AI feedback**. Our community uses Distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel), and **we love contributions to open-source** ourselves too.
61+
The Argilla community uses distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel).
6262

6363
- The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**.
6464
- Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B), show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.

docs/scripts/gen_popular_issues.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@
2424
REPOSITORY = "argilla-io/distilabel"
2525
DATA_PATH = "sections/community/popular_issues.md"
2626

27-
GITHUB_ACCESS_TOKEN = os.getenv("GH_ACCESS_TOKEN") # public_repo and read:org scopes are required
27+
# public_repo and read:org scopes are required
28+
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")
2829

2930

3031
def fetch_data_from_github(repository, auth_token):
@@ -79,7 +80,7 @@ def fetch_data_from_github(repository, auth_token):
7980

8081

8182
with mkdocs_gen_files.open(DATA_PATH, "w") as f:
82-
df = fetch_data_from_github(REPOSITORY, GITHUB_ACCESS_TOKEN)
83+
df = fetch_data_from_github(REPOSITORY, GITHUB_TOKEN)
8384

8485
open_issues = df.loc[df["State"] == "open"]
8586
engagement_df = (

docs/sections/getting_started/installation.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ hide:
99
!!! NOTE
1010
Since `distilabel` v1.0.0 was recently released, we refactored most of the stuff, so the installation below only applies to `distilabel` v1.0.0 and above.
1111

12-
You will need to have at least Python 3.8 or higher, up to Python 3.12, since support for the latter is still a work in progress.
12+
You will need to have at least Python 3.9 or higher, up to Python 3.12, since support for the latter is still a work in progress.
1313

1414
To install the latest release of the package from PyPI you can use the following command:
1515

@@ -46,7 +46,7 @@ Additionally, as part of `distilabel` some extra dependencies are available, mai
4646

4747
- `llama-cpp`: for using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) Python bindings for `llama.cpp` via the `LlamaCppLLM` integration.
4848

49-
- `mistralai`: for using models available in [Mistral AI API](https://mistral.ai/news/la-plateforme/) via the `MistralAILLM` integration. Note that the [`mistralai` Python client](https://github.com/mistralai/client-python) can only be installed from Python 3.9 onwards, so this is the only `distilabel` dependency that's not supported in Python 3.8.
49+
- `mistralai`: for using models available in [Mistral AI API](https://mistral.ai/news/la-plateforme/) via the `MistralAILLM` integration.
5050

5151
- `ollama`: for using [Ollama](https://ollama.com/) and their available models via `OllamaLLM` integration.
5252

docs/sections/getting_started/quickstart.md

+28
Original file line numberDiff line numberDiff line change
@@ -67,3 +67,31 @@ if __name__ == "__main__":
6767
7. We run the pipeline with the parameters for the `load_dataset` and `text_generation` steps. The `load_dataset` step will use the repository `distilabel-internal-testing/instruction-dataset-mini` and the `test` split, and the `text_generation` task will use the `generation_kwargs` with the `temperature` set to `0.7` and the `max_new_tokens` set to `512`.
6868

6969
8. Optionally, we can push the generated [`Distiset`][distilabel.distiset.Distiset] to the Hugging Face Hub repository `distilabel-example`. This will allow you to share the generated dataset with others and use it in other pipelines.
70+
71+
## Minimal example
72+
73+
`distilabel` gives a lot of flexibility to create your pipelines, but to start right away, you can omit a lot of the details and let default values:
74+
75+
```python
76+
from distilabel.llms import InferenceEndpointsLLM
77+
from distilabel.pipeline import Pipeline
78+
from distilabel.steps.tasks import TextGeneration
79+
from datasets import load_dataset
80+
81+
82+
dataset = load_dataset("distilabel-internal-testing/instruction-dataset-mini", split="test")
83+
84+
with Pipeline() as pipeline: # (1)
85+
TextGeneration(llm=InferenceEndpointsLLM(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")) # (2)
86+
87+
88+
if __name__ == "__main__":
89+
distiset = pipeline.run(dataset=dataset) # (3)
90+
distiset.push_to_hub(repo_id="distilabel-example")
91+
```
92+
93+
1. The [`Pipeline`][distilabel.pipeline.Pipeline] can take no arguments and generate a default name on it's own that will be tracked internally.
94+
95+
2. Just as with the [`Pipeline`][distilabel.pipeline.Pipeline], the [`Step`][distilabel.steps.base.Step]s don't explicitly need a name.
96+
97+
3. You can generate the dataset as you would normally do with Hugging Face and pass the dataset to the run method.

0 commit comments

Comments
 (0)