Skip to content

Commit 3910aca

Browse files
authored
distilabel v1.2.0
2 parents f9057f0 + 63ee8c5 commit 3910aca

File tree

224 files changed

+18861
-5809
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

224 files changed

+18861
-5809
lines changed

.github/workflows/codspeed.yml

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
name: Benchmarks
2+
3+
on:
4+
push:
5+
branches:
6+
- "main"
7+
pull_request:
8+
9+
concurrency:
10+
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
11+
cancel-in-progress: true
12+
13+
jobs:
14+
benchmarks:
15+
runs-on: ubuntu-latest
16+
steps:
17+
- uses: actions/checkout@v4
18+
19+
- name: Setup Python
20+
uses: actions/setup-python@v4
21+
with:
22+
python-version: "3.12"
23+
# Looks like it's not working very well for other people:
24+
# https://github.com/actions/setup-python/issues/436
25+
# cache: "pip"
26+
# cache-dependency-path: pyproject.toml
27+
28+
- uses: actions/cache@v3
29+
id: cache
30+
with:
31+
path: ${{ env.pythonLocation }}
32+
key: ${{ runner.os }}-python-${{ env.pythonLocation }}-${{ hashFiles('pyproject.toml') }}-benchmarks-v00
33+
34+
- name: Install dependencies
35+
if: steps.cache.outputs.cache-hit != 'true'
36+
run: ./scripts/install_dependencies.sh
37+
38+
- name: Run benchmarks
39+
uses: CodSpeedHQ/action@v2
40+
with:
41+
token: ${{ secrets.CODSPEED_TOKEN }}
42+
run: pytest tests/ --codspeed

.github/workflows/docs.yml

+4
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,10 @@ jobs:
4545
4646
- run: mike deploy dev --push
4747
if: github.ref == 'refs/heads/develop'
48+
env:
49+
GH_ACCESS_TOKEN: ${{ secrets.GH_ACCESS_TOKEN }}
4850

4951
- run: mike deploy ${{ github.ref_name }} latest --update-aliases --push
5052
if: startsWith(github.ref, 'refs/tags/')
53+
env:
54+
GH_ACCESS_TOKEN: ${{ secrets.GH_ACCESS_TOKEN }}

.github/workflows/test.yml

+8-10
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,12 @@ on:
99
types:
1010
- opened
1111
- synchronize
12+
workflow_dispatch:
13+
inputs:
14+
tmate_session:
15+
description: Starts the workflow with tmate enabled.
16+
required: false
17+
default: "false"
1218

1319
concurrency:
1420
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
@@ -19,7 +25,7 @@ jobs:
1925
runs-on: ubuntu-latest
2026
strategy:
2127
matrix:
22-
python-version: ["3.8", "3.9", "3.10", "3.11"]
28+
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
2329
fail-fast: false
2430

2531
steps:
@@ -42,14 +48,7 @@ jobs:
4248

4349
- name: Install dependencies
4450
if: steps.cache.outputs.cache-hit != 'true'
45-
run: |
46-
python_version=$(python -c "import sys; print(sys.version_info[:2])")
47-
48-
pip install -e .[dev,tests,anthropic,argilla,cohere,groq,hf-inference-endpoints,hf-transformers,litellm,llama-cpp,ollama,openai,outlines,vertexai,vllm]
49-
if [ "${python_version}" != "(3, 8)" ]; then
50-
pip install -e .[mistralai]
51-
fi;
52-
pip install git+https://github.com/argilla-io/LLM-Blender.git
51+
run: ./scripts/install_dependencies.sh
5352

5453
- name: Lint
5554
run: make lint
@@ -59,4 +58,3 @@ jobs:
5958

6059
- name: Integration Tests
6160
run: make integration-tests
62-
timeout-minutes: 5

.pre-commit-config.yaml

+2-3
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,10 @@ repos:
1111
- --fuzzy-match-generates-todo
1212

1313
- repo: https://github.com/charliermarsh/ruff-pre-commit
14-
rev: v0.1.4
14+
rev: v0.4.5
1515
hooks:
1616
- id: ruff
17-
args:
18-
- --fix
17+
args: [--fix]
1918
- id: ruff-format
2019

2120
ci:

Makefile

+2-2
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@ sources = src/distilabel tests
22

33
.PHONY: format
44
format:
5-
ruff --fix $(sources)
5+
ruff check --fix $(sources)
66
ruff format $(sources)
77

88
.PHONY: lint
99
lint:
10-
ruff $(sources)
10+
ruff check $(sources)
1111
ruff format --check $(sources)
1212

1313
.PHONY: unit-tests

README.md

+15-3
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Compute is expensive and output quality is important. We help you **focus on dat
5050

5151
Synthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.
5252

53-
## 🏘️ Community
53+
## Community
5454

5555
We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:
5656

@@ -68,7 +68,7 @@ Distilabel is a tool that can be used to **synthesize data and provide AI feedba
6868
- Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B),, show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.
6969
- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.
7070

71-
## 👨🏽‍💻 Installation
71+
## Installation
7272

7373
```sh
7474
pip install distilabel --upgrade
@@ -116,7 +116,7 @@ with Pipeline(
116116

117117
generate_with_openai = TextGeneration(llm=OpenAILLM(model="gpt-3.5-turbo"))
118118

119-
load_dataset.connect(generate_with_openai)
119+
load_dataset >> generate_with_openai
120120

121121
if __name__ == "__main__":
122122
distiset = pipeline.run(
@@ -153,3 +153,15 @@ If you build something cool with `distilabel` consider adding one of these badge
153153

154154
To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).
155155

156+
## Citation
157+
158+
```bibtex
159+
@misc{distilabel-argilla-2024,
160+
author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},
161+
title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},
162+
year = {2024},
163+
publisher = {GitHub},
164+
journal = {GitHub repository},
165+
howpublished = {\url{https://github.com/argilla-io/distilabel}}
166+
}
167+
```

docs/api/cli.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Command Line Interface (CLI)
22

3-
This section contains the API reference for the CLI. For more information on how to use the CLI, see [Tutorial - CLI](../sections/learn/tutorial/cli/index.md).
3+
This section contains the API reference for the CLI. For more information on how to use the CLI, see [Tutorial - CLI](../sections/how_to_guides/advanced/cli/index.md).
44

55
## Utility functions for the `distilabel pipeline` sub-commands
66

docs/api/distiset.md

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Distiset
2+
3+
This section contains the API reference for the Distiset. For more information on how to use the CLI, see [Tutorial - CLI](../sections/how_to_guides/advanced/distiset.md).
4+
5+
:::distilabel.distiset.Distiset
6+
:::distilabel.distiset.create_distiset

docs/api/llm/cohere.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# CohereLLM
2+
3+
::: distilabel.llms.cohere

docs/api/llm/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,6 @@
22

33
This section contains the API reference for the `distilabel` LLMs, both for the [`LLM`][distilabel.llms.LLM] synchronous implementation, and for the [`AsyncLLM`][distilabel.llms.AsyncLLM] asynchronous one.
44

5-
For more information and examples on how to use existing LLMs or create custom ones, please refer to [Tutorial - LLM](../../sections/learn/tutorial/llm/index.md).
5+
For more information and examples on how to use existing LLMs or create custom ones, please refer to [Tutorial - LLM](../../sections/how_to_guides/basic/llm/index.md).
66

77
::: distilabel.llms.base

docs/api/pipeline/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Pipeline
22

3-
This section contains the API reference for the `distilabel` pipelines. For an example on how to use the pipelines, see the [Tutorial - Pipeline](../../sections/learn/tutorial/pipeline/index.md).
3+
This section contains the API reference for the `distilabel` pipelines. For an example on how to use the pipelines, see the [Tutorial - Pipeline](../../sections/how_to_guides/basic/pipeline/index.md).
44

55
::: distilabel.pipeline.base
66
::: distilabel.pipeline.local

docs/api/step/decorator.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,6 @@
22

33
This section contains the reference for the `@step` decorator, used to create new [`Step`][distilabel.steps.Step] subclasses without having to manually define the class.
44

5-
For more information check the [Tutorial - Step](../../sections/learn/tutorial/step/index.md) page.
5+
For more information check the [Tutorial - Step](../../sections/how_to_guides/basic/step/index.md) page.
66

77
::: distilabel.steps.decorator

docs/api/step/generator_step.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,6 @@
22

33
This section contains the API reference for the [`GeneratorStep`][distilabel.steps.base.GeneratorStep] class.
44

5-
For more information and examples on how to use existing generator steps or create custom ones, please refer to [Tutorial - Step - GeneratorStep](../../sections/learn/tutorial/step/generator_step.md).
5+
For more information and examples on how to use existing generator steps or create custom ones, please refer to [Tutorial - Step - GeneratorStep](../../sections/how_to_guides/basic/step/generator_step.md).
66

77
::: distilabel.steps.base.GeneratorStep

docs/api/step/global_step.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,6 @@
22

33
This section contains the API reference for the [`GlobalStep`][distilabel.steps.base.GlobalStep] class.
44

5-
For more information and examples on how to use existing global steps or create custom ones, please refer to [Tutorial - Step - GlobalStep](../../sections/learn/tutorial/step/global_step.md).
5+
For more information and examples on how to use existing global steps or create custom ones, please refer to [Tutorial - Step - GlobalStep](../../sections/how_to_guides/basic/step/global_step.md).
66

77
::: distilabel.steps.base.GlobalStep

docs/api/step/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This section contains the API reference for the `distilabel` step, both for the [`_Step`][distilabel.steps.base._Step] base class and the [`Step`][distilabel.steps.Step] class.
44

5-
For more information and examples on how to use existing steps or create custom ones, please refer to [Tutorial - Step](../../sections/learn/tutorial/step/index.md).
5+
For more information and examples on how to use existing steps or create custom ones, please refer to [Tutorial - Step](../../sections/how_to_guides/basic/step/index.md).
66

77
::: distilabel.steps.base
88
options:

docs/api/step_gallery/columns.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Columns
22

3-
This section contains the existing steps intended to be used for commong column operations to apply to the batches.
3+
This section contains the existing steps intended to be used for common column operations to apply to the batches.
44

55
::: distilabel.steps.combine
66
::: distilabel.steps.expand

docs/api/step_gallery/extra.md

+1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Extra
22

3+
::: distilabel.steps.generators.data
34
::: distilabel.steps.deita
45
::: distilabel.steps.formatting
56
::: distilabel.steps.typing

docs/api/step_gallery/hugging_face.md

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Hugging Face
2+
3+
This section contains the existing steps integrated with `Hugging Face` so as to easily push the generated datasets to Hugging Face.
4+
5+
::: distilabel.steps.LoadDataFromDisk
6+
::: distilabel.steps.LoadDataFromFileSystem
7+
::: distilabel.steps.LoadDataFromHub

docs/api/task/generator_task.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,6 @@
22

33
This section contains the API reference for the `distilabel` generator tasks.
44

5-
For more information on how the [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask] works and see some examples, check the [Tutorial - Task - GeneratorTask](../../sections/learn/tutorial/task/generator_task.md) page.
5+
For more information on how the [`GeneratorTask`][distilabel.steps.tasks.GeneratorTask] works and see some examples, check the [Tutorial - Task - GeneratorTask](../../sections/how_to_guides/basic/task/generator_task.md) page.
66

77
::: distilabel.steps.tasks.base.GeneratorTask

docs/api/task/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This section contains the API reference for the `distilabel` tasks.
44

5-
For more information on how the [`Task`][distilabel.steps.tasks.Task] works and see some examples, check the [Tutorial - Task](../../sections/learn/tutorial/task/index.md) page.
5+
For more information on how the [`Task`][distilabel.steps.tasks.Task] works and see some examples, check the [Tutorial - Task](../../sections/how_to_guides/basic/task/index.md) page.
66

77
::: distilabel.steps.tasks.base
88
options:

docs/api/task/typing.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Task Typing
2+
3+
::: distilabel.steps.tasks.typing

docs/api/task_gallery/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ This section contains the existing [`Task`][distilabel.steps.tasks.Task] subclas
99
- "!_Task"
1010
- "!GeneratorTask"
1111
- "!ChatType"
12+
- "!typing"
-7.54 KB
Loading

0 commit comments

Comments
 (0)