Documentation for training using torchtitan (#179)

MaxiBoether · web-flow · commit 63b6759a5fb6 · 2025-02-19T22:44:20.000+01:00
diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ Mixtera is an open-source data-centric training data plane built for modern LLM/
 
 ## ⚡️ Quickstart
 
-Mixtera can run as a server (as presented in the paper) or, for single-GPU training, in-process. In both cases, you will need to install the necessary dependencies and install Mixtera in your environment, for example as follows:
+Mixtera can run as a server, or, for single-GPU training, in-process. In both cases, you will need to install the necessary dependencies and install Mixtera in your environment, for example as follows:
 
 ```bash
 # In case you don't have micromamba yet
@@ -38,13 +38,15 @@ Mixtera is a centralized sample management layer, building upon DuckDB. It abstr
 
 ## 🚀 Usage
 
-Using Mixtera typically consists of (1) registering your data and (2) running queries/trainings on top of it. We maintain several [examples](https://github.com/eth-easl/mixtera/blob/main/examples/) of how to use Mixtera and will build up more documentation over the next weeks. A good first read is the [local-only example](https://github.com/eth-easl/mixtera/blob/main/examples/client_local_example.py). That script walks you through the basics of registering data in Mixtera and running a query on that. Afterwards, the [server example](https://github.com/eth-easl/mixtera/blob/main/examples/client_server_example.py) shows you how to run a server with the `mixtera-server` command, and how to register data and query it via client-server interaction.
+Using Mixtera typically consists of (1) registering your data and (2) running queries/trainings on top of it. We maintain several [examples](https://github.com/eth-easl/mixtera/blob/main/examples/) of how to use Mixtera. A good first read is the [local-only example](https://github.com/eth-easl/mixtera/blob/main/examples/client_local_example.py). That script walks you through the basics of registering data in Mixtera and running a query on that. Afterwards, the [server example](https://github.com/eth-easl/mixtera/blob/main/examples/client_server_example.py) shows you how to run a server with the `mixtera-server` command, and how to register data and query it via client-server interaction.
 
-Coming soon: A guide on how to train a model in torchtitan with Mixtera, with and without ADO, on the SlimPajama dataset.
+We provide a [full guide](examples/torchtitan.md) on how to run a training with Mixtera and torchtitan, in particular on how to run the server, register the dataset, and then start training jobs, for both bare-metal and slurm (e.g., SwissAI/CSCS/Alps/Clariden) deployments.
 
 ## ✨ Mixtera’s System Overview
 
+<div align="center">
 <img src="img/system.png" height=300 alt="Mixtera system design"/>
+</div>
 
 Mixtera follows a server-client model. During training, the server runs on a node and each training node runs client instances. The query is executed at the server in two phases. First, Mixtera applies static filters from the query (e.g., English-only) to obtain all samples we could train on. This gives us a [QueryResult](https://github.com/eth-easl/mixtera/blob/main/mixtera/core/query/query_result.py). Second, during training, the server distributes [chunks](https://github.com/eth-easl/mixtera/blob/main/mixtera/core/query/result_chunk.py) of that query result to the client(s). A chunk is a collection of pointers to samples in files. These pointers tell the receiving client which samples in the file to load (e.g., sample 10 in file `wikipedia.jsonl.zst`).
 
diff --git a/examples/clariden/Dockerfile b/examples/clariden/Dockerfile
@@ -0,0 +1,29 @@
+FROM nvcr.io/nvidia/pytorch:25.01-py3
+
+RUN apt-get update && apt-get upgrade -y && apt-get install ca-certificates lsb-release wget python3-pip neovim autoconf build-essential gdb software-properties-common curl unzip cmake gzip protobuf-compiler libtool zstd liblz4-dev lz4 -y
+
+RUN wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
+RUN apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
+RUN apt update
+RUN apt install -y -V  libparquet-glib-dev libparquet-dev libarrow-dataset-glib-dev libarrow-dataset-dev libarrow-glib-dev libarrow-dev
+
+RUN pip install pip==24.*
+
+# If you encounter pyarrow issues, ensure the version here matches the version downloaded above!!
+RUN pip install tqdm loguru psutil numpy==1.26.4 dill datasets transformers pyarrow==19.*  xxhash xopen scipy tenacity
+RUN pip install duckdb polars==1.15 pillow pybind11 pytest flake8 mypy pylint autopep8 isort black tensorboard tiktoken blobfile tabulate wandb torchdata>=0.8.0 tomli>=1.1.0 dacite pyyaml packaging safetensors sentencepiece jupyter seaborn webdataset lz4  git+https://github.com/tmbdev/webdataset.git@v0.2.107 mosaicml-streaming grain
+RUN pip install lm_eval typer # for evaluation
+
+# Test torch nightly
+RUN pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
+
+RUN git clone --recurse-submodules -b v1.64.3 --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
+    cd grpc && mkdir -p cmake/build && cd cmake/build && \
+    cmake -DgRPC_PROTOBUF_PROVIDER=module -DABSL_ENABLE_INSTALL=On -DgRPC_BUILD_CSHARP_EXT=Off -DABSL_BUILD_TESTING=Off -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=Release ../.. && \
+    make -j64 && make install && cd ../../
+
+RUN bash -c "cp /usr/local/lib/libutf8* /usr/lib"
+
+## For nanotron
+RUN pip uninstall -y ninja && pip install ninja
+RUN MAX_JOBS=12 numactl --membind=0-3 pip install  flash-attn --no-build-isolation
diff --git a/examples/download_slim_pajama.py b/examples/download_slim_pajama.py
@@ -0,0 +1,77 @@
+#!/usr/bin/env python3
+import os
+import argparse
+import requests
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
+def download_file(url, target_path):
+    """Attempt to download a file from 'url' to 'target_path' up to 3 tries."""
+    tries = 3
+    for attempt in range(tries):
+        try:
+            response = requests.get(url, stream=True)
+            if response.status_code == 404:
+                return None
+            with open(target_path, "wb") as f:
+                for chunk in response.iter_content(chunk_size=8192):
+                    f.write(chunk)
+            return True
+        except requests.RequestException as e:
+            if attempt < tries - 1:
+                continue
+            else:
+                raise Exception(f"Failed to download {url} after {tries} attempts") from e
+
+def download_chunk_files(chunk_id, base_url, target_dir):
+    """Download all files for a given chunk in batches until a 404 is encountered."""
+    os.makedirs(target_dir, exist_ok=True)
+    batch_size = 500
+    file_index = 0
+
+    while True:
+        with ThreadPoolExecutor(max_workers=16) as executor:
+            futures = {}
+            for _ in range(batch_size):
+                file_url = f"{base_url}/chunk{chunk_id}/example_train_{file_index}.jsonl.zst?download=true"
+                target_path = os.path.join(target_dir, f"ch{chunk_id}_example_train_{file_index}.jsonl.zst")
+                futures[executor.submit(download_file, file_url, target_path)] = file_index
+                file_index += 1
+
+            break_after_loop = False
+            for future in as_completed(futures):
+                result = future.result()
+                if result is None:
+                    break_after_loop = True
+
+            if break_after_loop:
+                return
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Download files for specified chunks from a base URL."
+    )
+    parser.add_argument(
+        "--target-dir",
+        type=str,
+        required=True,
+        help="The base directory where the datasets will be saved."
+    )
+    parser.add_argument(
+        "--chunks",
+        type=int,
+        nargs="+",
+        default=list(range(1, 11)),
+        help="List of chunk IDs to download (default: 1 2 ... 10)."
+    )
+    args = parser.parse_args()
+
+    base_url = "https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/train"
+    target_dir_base = args.target_dir
+
+    for chunk_id in args.chunks:
+        target_dir = os.path.join(target_dir_base, f"chunk{chunk_id}")
+        print(f"Downloading chunk {chunk_id} to {target_dir}...")
+        download_chunk_files(chunk_id, base_url, target_dir)
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/torchtitan.md b/examples/torchtitan.md