From 87dc6bf7692eed2b92ed3de51bcabc0ab0fc89c4 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Thu, 26 Jun 2025 15:57:17 +0100 Subject: [PATCH 01/16] Add UV scripts guide introduction and example - Explain what UV is and how it works with hfjobs - Add concrete hello world example with cowsay - Document key benefits for ML workflows - Set up document structure for future sections --- docs/uv_scripts.md | 143 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 143 insertions(+) create mode 100644 docs/uv_scripts.md diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md new file mode 100644 index 0000000..3c13344 --- /dev/null +++ b/docs/uv_scripts.md @@ -0,0 +1,143 @@ +# Using UV to Run Scripts with hfjobs + +This guide explains how to use uv to run scripts with hfjobs. + +## What is UV? + +UV is a Python package manager that can run Python scripts directly. The simplest way to use UV with hfjobs is to run any Python script: + +```bash +# Run a script from a URL +hfjobs run ghcr.io/astral-sh/uv:debian-slim uv run https://example.com/script.py +``` + +This works with any Python script - no special setup required! + +On its own, this isn't very exciting, you can also run a python script directly with Python! One of the things that makes uv more powerful is the ability to declare dependencies directly in your Python scripts, which allows you to run them without needing to install anything manually. + +### UV Scripts: Adding Dependencies + +Let's look at a simple example of a Python script with dependencies. This script relies on the `cowsay` library to print a message: + +```python +# /// script +# requires-python = ">=3.8" +# dependencies = [ +# "cowsay", +# ] +# /// +"""A simple UV script example for hfjobs. +This script demonstrates how UV scripts can specify their dependencies +inline, making them perfect for running with hfjobs. +""" + +import cowsay +import sys + + +def main(): + message = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else "Hello from hfjobs!" + cowsay.cow(message) + + +if __name__ == "__main__": + main() +``` + +If we have the script saved as `hello_world_uv.py`, you can run it locally (assuming you have uv installed) like this: + +```bash +uv run hello_world_uv.py "Hello from my CLI!" +``` + +We can also run uv scripts via a URL: + +```bash +uv run https://raw.githubusercontent.com/davanstrien/hfjobs/refs/heads/quickstart-only/docs/examples/hello_world_uv.py "Hello from my CLI, I arrived from the internet via a URL!" +``` + +Now, to run it on Hugging Face infrastructure using hfjobs we would simply need to instead run: + + + +```bash +hfjobs run ghcr.io/astral-sh/uv:debian-slim uv run https://raw.githubusercontent.com/davanstrien/hfjobs/refs/heads/quickstart-only/docs/examples/hello_world_uv.py "Hello from hfjobs!" +``` + +This command runs your script on Hugging Face's infrastructure, automatically installing cowsay in an isolated environment. We'll explain how this works in detail later, but the key point is that you can run any Python script with dependencies on Hugging Face infrastructure using a single command! + +### Why UV Scripts + hfjobs? + +UV scripts solve a fundamental challenge when running code on remote infrastructure: dependency management. Instead of building Docker images or manually installing packages, UV scripts let you declare dependencies right in your Python file. + +**Key benefits for hfjobs users:** + +- **Zero setup**: Your script runs anywhere with just a URL - no Docker knowledge needed +- **Self-contained**: Dependencies travel with your code, ensuring reproducibility +- **Instant iteration**: Change dependencies without rebuilding containers +- **Perfect for sharing**: Send colleagues a single command that just works + +**Ideal for ML workflows:** + +UV scripts are particularly powerful for machine learning tasks. That `train.py` script you've been working on? Add a UV header with your dependencies, and it's ready to run on GPUs with hfjobs. When your script includes a CLI (using argparse or click), you get a flexible tool that can handle different datasets, models, and hyperparameters - we'll show examples of this pattern throughout the guide. + +You can think of UV scripts as "portable cloud functions" - your Python script becomes a complete, runnable unit that hfjobs can execute on any hardware with one command. + +## Understanding UV Scripts + +### Script Header Format + +### Dependency Declaration + +### Python Version Requirements + +## Running Scripts with UV and hfjobs + +### Making the script available (TODO better name) + +- Running a public script +- uploading script to HF + +### Basic Command Pattern + +### Choosing Docker Images + +### Environment Setup + +## Examples + +### Example 1: Simple Script (CPU) + +### Example 2: Data Processing with Dependencies + +### Example 3: GPU Workload with ML Libraries + +### Example 4: Production vLLM Example + +## Best Practices + +### Script Design for Cloud + +### Error Handling + +### Resource Management + +## Common Patterns + +### Data Input/Output + +### Authentication + +### Monitoring Progress + +## Debugging and Troubleshooting + +### Common Issues + +### Testing Locally vs Cloud + +## Reference + +### Quick Command Templates + +### Links to More Examples From bbf764389221e4b6372d7daadc0ae420492e4e90 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Thu, 26 Jun 2025 17:19:26 +0100 Subject: [PATCH 02/16] Complete Understanding UV Scripts section - Add uv init --script command for creating templates - Show example output of generated script - Focus on uv add --script as recommended approach - Document alternative package indexes with vLLM example - Simplify Python version requirements section - Add links to official UV documentation --- docs/uv_scripts.md | 124 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 124 insertions(+) diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md index 3c13344..7842285 100644 --- a/docs/uv_scripts.md +++ b/docs/uv_scripts.md @@ -85,12 +85,132 @@ You can think of UV scripts as "portable cloud functions" - your Python script b ## Understanding UV Scripts +In this section, we'll cover the basics of uv scripts. To avoid duplicating the official [uv documentation for scripts](https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies), we'll focus on the key aspects that are relevant for running scripts with hfjobs. + +UV scripts are Python files that include a special header to declare dependencies and metadata. We can create a template UV script using the `uv init` command with the `--script` flag. This command initializes a new Python script with the necessary UV header: + +```bash +uv init --script example.py +``` + +This creates a file named `example.py` with the following header: + +```python +# /// script +# requires-python = ">=3.12" +# dependencies = [] +# /// +``` + ### Script Header Format +UV scripts use a special comment block at the top of your Python file to declare metadata. This header follows a specific format: + +```python +# /// script +# dependencies = [ +# "package1", +# "package2", +# ] +# /// +``` + +Key points: + +- The header starts with `# /// script` and ends with `# ///` +- Everything between these markers uses TOML format +- The `dependencies` field is required (even if empty) +- All lines must be prefixed with `#` and a space + +A minimal UV script looks like this: + +```python +# /// script +# dependencies = [] +# /// + +print("Hello, world!") +``` + ### Dependency Declaration +The easiest way to add dependencies to your UV script is using the `uv add` command: + +```bash +# Add a single package +uv add --script script.py numpy + +# Add multiple packages +uv add --script script.py pandas polars requests + +# Add packages with version constraints +uv add --script script.py "torch>=2.0" "transformers<5.0" + +# Add from a requirements file +uv add --script script.py --requirements requirements.txt +``` + +This automatically updates your script header with the dependencies: + +```python +# /// script +# dependencies = [ +# "numpy", +# "pandas", +# "polars", +# "requests", +# "torch>=2.0", +# "transformers<5.0", +# ] +# /// +``` + +**Understanding the syntax:** + +Dependencies work like `requirements.txt` entries: + +- `"numpy"` - Latest version +- `"pandas>=2.0.0"` - Minimum version +- `"torch==2.1.0"` - Exact version +- `"transformers>=4.30,<5.0"` - Version range + +### Using alternative package indexes + +Quite often in an ML context, you may want to use a package index other than PyPI, such as the vLLM wheels index. You can specify an alternative index using the `--index` flag with `uv add`: + +```bash +uv add --index "https://wheels.vllm.ai/nightly" --script example.py vllm +``` + +This will result in adding the following to your script header: + +```python +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "vllm", +# ] +# +# [[tool.uv.index]] +# url = "https://wheels.vllm.ai/nightly" +# /// +``` + +This will let uv know to use the specified index when installing dependencies for this script. + +See [uv docs](https://docs.astral.sh/uv/guides/scripts/#using-alternative-package-indexes) for more details on using alternative package indexes. + ### Python Version Requirements +You can specify which Python version your script requires using the `requires-python` field: + +```python +# /// script +# requires-python = ">=3.8" +# dependencies = ["numpy", "pandas"] +# /// +``` + ## Running Scripts with UV and hfjobs ### Making the script available (TODO better name) @@ -141,3 +261,7 @@ You can think of UV scripts as "portable cloud functions" - your Python script b ### Quick Command Templates ### Links to More Examples + +``` + +``` From 415ff7b3886778baace5c3cd777c0e8c53d5e70f Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Thu, 26 Jun 2025 18:12:25 +0100 Subject: [PATCH 03/16] draft simple uv usage guide --- docs/uv_scripts.md | 251 ++++++++++++++++++++++++++------------------- 1 file changed, 144 insertions(+), 107 deletions(-) diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md index 7842285..f524566 100644 --- a/docs/uv_scripts.md +++ b/docs/uv_scripts.md @@ -83,185 +83,222 @@ UV scripts are particularly powerful for machine learning tasks. That `train.py` You can think of UV scripts as "portable cloud functions" - your Python script becomes a complete, runnable unit that hfjobs can execute on any hardware with one command. -## Understanding UV Scripts +## Getting Started -In this section, we'll cover the basics of uv scripts. To avoid duplicating the official [uv documentation for scripts](https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies), we'll focus on the key aspects that are relevant for running scripts with hfjobs. +Let's create and run your first UV script on Hugging Face's infrastructure. -UV scripts are Python files that include a special header to declare dependencies and metadata. We can create a template UV script using the `uv init` command with the `--script` flag. This command initializes a new Python script with the necessary UV header: +### 1. Create a UV Script + +First, create a new UV script using the `uv init` command: ```bash -uv init --script example.py +uv init --script process_data.py ``` -This creates a file named `example.py` with the following header: +This creates a template script: ```python # /// script # requires-python = ">=3.12" # dependencies = [] # /// -``` - -### Script Header Format -UV scripts use a special comment block at the top of your Python file to declare metadata. This header follows a specific format: - -```python -# /// script -# dependencies = [ -# "package1", -# "package2", -# ] -# /// -``` - -Key points: - -- The header starts with `# /// script` and ends with `# ///` -- Everything between these markers uses TOML format -- The `dependencies` field is required (even if empty) -- All lines must be prefixed with `#` and a space - -A minimal UV script looks like this: - -```python -# /// script -# dependencies = [] -# /// +def main(): + print("Hello from UV!") -print("Hello, world!") +if __name__ == "__main__": + main() ``` -### Dependency Declaration +### 2. Add Dependencies -The easiest way to add dependencies to your UV script is using the `uv add` command: +Add the packages your script needs: ```bash -# Add a single package -uv add --script script.py numpy - -# Add multiple packages -uv add --script script.py pandas polars requests +# For data processing +uv add --script process_data.py pandas pyarrow requests -# Add packages with version constraints -uv add --script script.py "torch>=2.0" "transformers<5.0" - -# Add from a requirements file -uv add --script script.py --requirements requirements.txt +# For machine learning +uv add --script process_data.py torch transformers datasets ``` -This automatically updates your script header with the dependencies: +Your script header now includes the dependencies: ```python # /// script +# requires-python = ">=3.12" # dependencies = [ -# "numpy", # "pandas", -# "polars", +# "pyarrow", # "requests", -# "torch>=2.0", -# "transformers<5.0", +# "torch", +# "transformers", +# "datasets", # ] # /// ``` -**Understanding the syntax:** +### 3. Test Locally -Dependencies work like `requirements.txt` entries: +Make sure your script works: -- `"numpy"` - Latest version -- `"pandas>=2.0.0"` - Minimum version -- `"torch==2.1.0"` - Exact version -- `"transformers>=4.30,<5.0"` - Version range +```bash +uv run process_data.py +``` -### Using alternative package indexes +### 4. Upload to Hugging Face Hub -Quite often in an ML context, you may want to use a package index other than PyPI, such as the vLLM wheels index. You can specify an alternative index using the `--index` flag with `uv add`: +Create a dataset repository for your scripts: ```bash -uv add --index "https://wheels.vllm.ai/nightly" --script example.py vllm +# Create a dataset repo (only needed once) +huggingface-cli repo create my-uv-scripts --type dataset + +# Upload your script +huggingface-cli upload my-uv-scripts process_data.py scripts/process_data.py --repo-type dataset ``` -This will result in adding the following to your script header: +### 5. Run with hfjobs -```python -# /// script -# requires-python = ">=3.12" -# dependencies = [ -# "vllm", -# ] -# -# [[tool.uv.index]] -# url = "https://wheels.vllm.ai/nightly" -# /// +Now run your script on HF infrastructure: + +```bash +# CPU execution +hfjobs run ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/{your-username}/my-uv-scripts/raw/main/scripts/process_data.py" + +# GPU execution +hfjobs run --flavor gpu-nvidia-small ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/{your-username}/my-uv-scripts/raw/main/scripts/process_data.py" ``` -This will let uv know to use the specified index when installing dependencies for this script. +That's it! Your script is running on Hugging Face's infrastructure with all dependencies automatically installed. -See [uv docs](https://docs.astral.sh/uv/guides/scripts/#using-alternative-package-indexes) for more details on using alternative package indexes. +## Key Concepts for Running UV Scripts -### Python Version Requirements +### Basic Command Pattern -You can specify which Python version your script requires using the `requires-python` field: +The pattern for running UV scripts with hfjobs is: -```python -# /// script -# requires-python = ">=3.8" -# dependencies = ["numpy", "pandas"] -# /// +```bash +hfjobs run /bin/bash -c "uv run " ``` -## Running Scripts with UV and hfjobs +For most cases, use the lightweight UV image: +- **`ghcr.io/astral-sh/uv:debian-slim`** - Fast startup, includes UV and Python -### Making the script available (TODO better name) +### Common Options -- Running a public script -- uploading script to HF +**Running on GPU:** +```bash +hfjobs run --flavor gpu-nvidia-small ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/{username}/my-scripts/raw/main/train.py" +``` -### Basic Command Pattern +**Passing secrets (like HF token):** +```bash +hfjobs run --secret HF_TOKEN=$HF_TOKEN ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/{username}/my-scripts/raw/main/upload.py" +``` + +**Setting environment variables:** +```bash +hfjobs run ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "export HOME=/tmp && uv run your_script.py" +``` + +For advanced topics like Docker image selection, environment setup, and system dependencies, see the [advanced guide](./uv_scripts_advanced.md). -### Choosing Docker Images +## Example: Process a Hugging Face Dataset -### Environment Setup +Here's a complete example that downloads and analyzes a dataset: -## Examples +```python +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "datasets", +# "pandas", +# ] +# /// -### Example 1: Simple Script (CPU) +import argparse +from datasets import load_dataset +import pandas as pd -### Example 2: Data Processing with Dependencies +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("dataset", help="Dataset name (e.g., 'imdb')") + parser.add_argument("--max-samples", type=int, default=100) + args = parser.parse_args() + + # Load dataset + print(f"Loading {args.dataset}...") + ds = load_dataset(args.dataset, split=f"train[:{args.max_samples}]") + + # Basic analysis + df = pd.DataFrame(ds) + print(f"\nDataset shape: {df.shape}") + print(f"Columns: {list(df.columns)}") + print(f"\nFirst example:") + print(df.iloc[0].to_dict()) -### Example 3: GPU Workload with ML Libraries +if __name__ == "__main__": + main() +``` -### Example 4: Production vLLM Example +Run this example: +```bash +# Upload to HF Hub +huggingface-cli upload my-uv-scripts analyze.py scripts/analyze.py --repo-type dataset -## Best Practices +# Run on CPU +hfjobs run ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/{username}/my-uv-scripts/raw/main/scripts/analyze.py imdb --max-samples 1000" +``` -### Script Design for Cloud +## Quick Reference -### Error Handling +### Essential Commands -### Resource Management +```bash +# Create UV script +uv init --script myscript.py -## Common Patterns +# Add dependencies +uv add --script myscript.py pandas torch -### Data Input/Output +# Test locally +uv run myscript.py -### Authentication +# Upload to HF +huggingface-cli upload my-uv-scripts myscript.py scripts/myscript.py --repo-type dataset -### Monitoring Progress +# Run on CPU +hfjobs run ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/{username}/my-uv-scripts/raw/main/scripts/myscript.py" -## Debugging and Troubleshooting +# Run on GPU +hfjobs run --flavor gpu-nvidia-small ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/{username}/my-uv-scripts/raw/main/scripts/myscript.py" -### Common Issues +# With secrets +hfjobs run --secret HF_TOKEN=$HF_TOKEN ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run your_script.py" +``` -### Testing Locally vs Cloud +### Getting Help -## Reference +- **UV documentation**: https://docs.astral.sh/uv/ +- **hfjobs documentation**: https://github.com/huggingface/hfjobs +- **This guide (advanced topics)**: [uv_scripts_advanced.md](./uv_scripts_advanced.md) -### Quick Command Templates +## Next Steps -### Links to More Examples +You now have everything you need to run UV scripts on Hugging Face's infrastructure! Try: -``` +1. Modifying the example for your use case +2. Exploring GPU options for ML workloads +3. Building a collection of reusable scripts -``` +Happy scripting! šŸš€ From ac6f0c630cef3b25cd5f59a1910303537ef53c07 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 09:11:43 +0100 Subject: [PATCH 04/16] grammar --- docs/uv_scripts.md | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md index f524566..74d9ef0 100644 --- a/docs/uv_scripts.md +++ b/docs/uv_scripts.md @@ -4,7 +4,7 @@ This guide explains how to use uv to run scripts with hfjobs. ## What is UV? -UV is a Python package manager that can run Python scripts directly. The simplest way to use UV with hfjobs is to run any Python script: +UV is a Python package manager that can run Python scripts. The simplest way to use UV with hfjobs is to run any Python script: ```bash # Run a script from a URL @@ -13,7 +13,7 @@ hfjobs run ghcr.io/astral-sh/uv:debian-slim uv run https://example.com/script.py This works with any Python script - no special setup required! -On its own, this isn't very exciting, you can also run a python script directly with Python! One of the things that makes uv more powerful is the ability to declare dependencies directly in your Python scripts, which allows you to run them without needing to install anything manually. +On its own, this isn't very exciting; you can also run a Python script directly with Python! One of the features that makes UV more powerful is the ability to declare dependencies directly in your Python scripts, which allows you to run them without needing to install any dependencies manually. ### UV Scripts: Adding Dependencies @@ -185,23 +185,27 @@ hfjobs run /bin/bash -c "uv run " ``` For most cases, use the lightweight UV image: + - **`ghcr.io/astral-sh/uv:debian-slim`** - Fast startup, includes UV and Python ### Common Options **Running on GPU:** + ```bash hfjobs run --flavor gpu-nvidia-small ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ "uv run https://huggingface.co/datasets/{username}/my-scripts/raw/main/train.py" ``` **Passing secrets (like HF token):** + ```bash hfjobs run --secret HF_TOKEN=$HF_TOKEN ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ "uv run https://huggingface.co/datasets/{username}/my-scripts/raw/main/upload.py" ``` **Setting environment variables:** + ```bash hfjobs run ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ "export HOME=/tmp && uv run your_script.py" @@ -231,11 +235,11 @@ def main(): parser.add_argument("dataset", help="Dataset name (e.g., 'imdb')") parser.add_argument("--max-samples", type=int, default=100) args = parser.parse_args() - + # Load dataset print(f"Loading {args.dataset}...") ds = load_dataset(args.dataset, split=f"train[:{args.max_samples}]") - + # Basic analysis df = pd.DataFrame(ds) print(f"\nDataset shape: {df.shape}") @@ -248,6 +252,7 @@ if __name__ == "__main__": ``` Run this example: + ```bash # Upload to HF Hub huggingface-cli upload my-uv-scripts analyze.py scripts/analyze.py --repo-type dataset @@ -265,7 +270,7 @@ hfjobs run ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ # Create UV script uv init --script myscript.py -# Add dependencies +# Add dependencies uv add --script myscript.py pandas torch # Test locally @@ -298,7 +303,7 @@ hfjobs run --secret HF_TOKEN=$HF_TOKEN ghcr.io/astral-sh/uv:debian-slim /bin/bas You now have everything you need to run UV scripts on Hugging Face's infrastructure! Try: 1. Modifying the example for your use case -2. Exploring GPU options for ML workloads +2. Exploring GPU options for ML workloads 3. Building a collection of reusable scripts Happy scripting! šŸš€ From 2240f1770e79a11a25ac439730efd94cc95fbe04 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 09:12:13 +0100 Subject: [PATCH 05/16] grammar --- docs/uv_scripts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md index 74d9ef0..58cab1e 100644 --- a/docs/uv_scripts.md +++ b/docs/uv_scripts.md @@ -56,7 +56,7 @@ We can also run uv scripts via a URL: uv run https://raw.githubusercontent.com/davanstrien/hfjobs/refs/heads/quickstart-only/docs/examples/hello_world_uv.py "Hello from my CLI, I arrived from the internet via a URL!" ``` -Now, to run it on Hugging Face infrastructure using hfjobs we would simply need to instead run: +Now, to run it on Hugging Face infrastructure using hfjobs we would simply need to run instead: From e1d8e931927efe60b20efbca727fb60fcedde9e9 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 09:14:02 +0100 Subject: [PATCH 06/16] add link --- docs/uv_scripts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md index 58cab1e..508502a 100644 --- a/docs/uv_scripts.md +++ b/docs/uv_scripts.md @@ -17,7 +17,7 @@ On its own, this isn't very exciting; you can also run a Python script directly ### UV Scripts: Adding Dependencies -Let's look at a simple example of a Python script with dependencies. This script relies on the `cowsay` library to print a message: +Let's look at a simple example of a Python script with dependencies. This script relies on the [`cowsay`](https://pypi.org/project/cowsay/) library to print a message: ```python # /// script From cc153b9c4732a3325d0148969958c48d5c33344f Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 09:15:04 +0100 Subject: [PATCH 07/16] install link --- docs/uv_scripts.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md index 508502a..7dfcdf0 100644 --- a/docs/uv_scripts.md +++ b/docs/uv_scripts.md @@ -4,7 +4,7 @@ This guide explains how to use uv to run scripts with hfjobs. ## What is UV? -UV is a Python package manager that can run Python scripts. The simplest way to use UV with hfjobs is to run any Python script: +[UV](https://docs.astral.sh/uv) is a Python package manager that can run Python scripts. The simplest way to use UV with hfjobs is to run any Python script: ```bash # Run a script from a URL @@ -15,6 +15,10 @@ This works with any Python script - no special setup required! On its own, this isn't very exciting; you can also run a Python script directly with Python! One of the features that makes UV more powerful is the ability to declare dependencies directly in your Python scripts, which allows you to run them without needing to install any dependencies manually. +### Install UV + +See [the UV documentation](https://docs.astral.sh/uv/installation/) for up to date installation instructions. + ### UV Scripts: Adding Dependencies Let's look at a simple example of a Python script with dependencies. This script relies on the [`cowsay`](https://pypi.org/project/cowsay/) library to print a message: From 78a2f185a79bff8212f464f51b58067e9935b846 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 10:03:53 +0100 Subject: [PATCH 08/16] better example --- docs/uv_scripts.md | 63 +++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 60 insertions(+), 3 deletions(-) diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md index 7dfcdf0..3a55285 100644 --- a/docs/uv_scripts.md +++ b/docs/uv_scripts.md @@ -62,7 +62,7 @@ uv run https://raw.githubusercontent.com/davanstrien/hfjobs/refs/heads/quickstar Now, to run it on Hugging Face infrastructure using hfjobs we would simply need to run instead: - + ```bash hfjobs run ghcr.io/astral-sh/uv:debian-slim uv run https://raw.githubusercontent.com/davanstrien/hfjobs/refs/heads/quickstart-only/docs/examples/hello_world_uv.py "Hello from hfjobs!" @@ -188,6 +188,8 @@ The pattern for running UV scripts with hfjobs is: hfjobs run /bin/bash -c "uv run " ``` +The `/bin/bash -c` wrapper allows us to run shell commands (like setting environment variables) before executing the UV script. + For most cases, use the lightweight UV image: - **`ghcr.io/astral-sh/uv:debian-slim`** - Fast startup, includes UV and Python @@ -223,7 +225,7 @@ Here's a complete example that downloads and analyzes a dataset: ```python # /// script -# requires-python = ">=3.12" +# requires-python = ">=3.11" # dependencies = [ # "datasets", # "pandas", @@ -266,6 +268,62 @@ hfjobs run ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ "uv run https://huggingface.co/datasets/{username}/my-uv-scripts/raw/main/scripts/analyze.py imdb --max-samples 1000" ``` +You should see output like: + +```python +Loading imdb... +train-00000-of-00001.parquet: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 21.0M/21.0M [00:00<00:00, 64.4MB/s] +... +Dataset shape: (100, 2) +Columns: ['text', 'label'] +First example: +... +``` + +## Saving Your Results + +When your script runs on Hugging Face infrastructure, any output to stdout is displayed in your terminal. Often though, you don't just want to print results; you want to save them to a file or upload them somewhere. + +You can do this in a few ways: + +### Option 1: Use existing push_to_hub functionality + +The Transformers, TRL, datasets libraries (and many more!) can push results to the Hugging Face Hub directly using their built-in `push_to_hub` functionality. This is the recommended way to save models, datasets, and other artifacts. This means you can use the same code you would use locally to save your results, and it will work seamlessly on Hugging Face's infrastructure. + +### Option 2: Upload results using the `huggingface-hub` library + +Add the `huggingface-hub` library to your script and upload results directly: + +```python +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "datasets", +# "pandas", +# "huggingface-hub", +# ] +# /// + +from huggingface_hub import HfApi +import os + +# Your processing code here... + +# Upload results +api = HfApi() +api.upload_file( + path_or_fileobj="results.csv", + path_in_repo="outputs/results.csv", + repo_id="username/my-results", + repo_type="dataset", + token=os.environ.get("HF_TOKEN") +) +``` + +### Option 3: Use a directory to store results + +You can also write results to a directory and then upload that directory as a dataset. For example if you were saving multiple checkpoints or filtered version of a dataset to a `output` directory you could use `upload_folder` to upload to the hub (or use `upload_large_folder` if you are uploading a large amount of data). + ## Quick Reference ### Essential Commands @@ -300,7 +358,6 @@ hfjobs run --secret HF_TOKEN=$HF_TOKEN ghcr.io/astral-sh/uv:debian-slim /bin/bas - **UV documentation**: https://docs.astral.sh/uv/ - **hfjobs documentation**: https://github.com/huggingface/hfjobs -- **This guide (advanced topics)**: [uv_scripts_advanced.md](./uv_scripts_advanced.md) ## Next Steps From 59db8fbcda5ebfb008ad80b7f12ea8229e050baf Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 10:17:13 +0100 Subject: [PATCH 09/16] make consistent --- docs/uv_scripts.md | 42 ++++++++++++++++++++++++++++++++---------- 1 file changed, 32 insertions(+), 10 deletions(-) diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md index 3a55285..ac17d25 100644 --- a/docs/uv_scripts.md +++ b/docs/uv_scripts.md @@ -103,7 +103,7 @@ This creates a template script: ```python # /// script -# requires-python = ">=3.12" +# requires-python = ">=3.8" # dependencies = [] # /// @@ -130,7 +130,7 @@ Your script header now includes the dependencies: ```python # /// script -# requires-python = ">=3.12" +# requires-python = ">=3.8" # dependencies = [ # "pandas", # "pyarrow", @@ -169,11 +169,11 @@ Now run your script on HF infrastructure: ```bash # CPU execution hfjobs run ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ - "uv run https://huggingface.co/datasets/{your-username}/my-uv-scripts/raw/main/scripts/process_data.py" + "uv run https://huggingface.co/datasets/{username}/my-uv-scripts/raw/main/scripts/process_data.py" # GPU execution hfjobs run --flavor gpu-nvidia-small ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ - "uv run https://huggingface.co/datasets/{your-username}/my-uv-scripts/raw/main/scripts/process_data.py" + "uv run https://huggingface.co/datasets/{username}/my-uv-scripts/raw/main/scripts/process_data.py" ``` That's it! Your script is running on Hugging Face's infrastructure with all dependencies automatically installed. @@ -225,7 +225,7 @@ Here's a complete example that downloads and analyzes a dataset: ```python # /// script -# requires-python = ">=3.11" +# requires-python = ">=3.8" # dependencies = [ # "datasets", # "pandas", @@ -270,14 +270,17 @@ hfjobs run ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ You should see output like: -```python +``` Loading imdb... -train-00000-of-00001.parquet: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 21.0M/21.0M [00:00<00:00, 64.4MB/s] -... +Downloading readme: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 7.83k/7.83k [00:00<00:00, 3.91MB/s] +Downloading data: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 21.0M/21.0M [00:00<00:00, 64.4MB/s] +Generating train split: 25000 examples [00:00, 68054.94 examples/s] + Dataset shape: (100, 2) Columns: ['text', 'label'] + First example: -... +{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967...', 'label': 0} ``` ## Saving Your Results @@ -296,7 +299,7 @@ Add the `huggingface-hub` library to your script and upload results directly: ```python # /// script -# requires-python = ">=3.12" +# requires-python = ">=3.8" # dependencies = [ # "datasets", # "pandas", @@ -324,6 +327,24 @@ api.upload_file( You can also write results to a directory and then upload that directory as a dataset. For example if you were saving multiple checkpoints or filtered version of a dataset to a `output` directory you could use `upload_folder` to upload to the hub (or use `upload_large_folder` if you are uploading a large amount of data). +```python +from huggingface_hub import HfApi +import os + +# Your processing that creates multiple files... +# e.g., saving to output/checkpoint1.pt, output/checkpoint2.pt, etc. + +# Upload the entire directory +api = HfApi() +api.upload_folder( + folder_path="./output", + path_in_repo="experiment_results", + repo_id="username/my-experiments", + repo_type="dataset", + token=os.environ.get("HF_TOKEN") +) +``` + ## Quick Reference ### Essential Commands @@ -358,6 +379,7 @@ hfjobs run --secret HF_TOKEN=$HF_TOKEN ghcr.io/astral-sh/uv:debian-slim /bin/bas - **UV documentation**: https://docs.astral.sh/uv/ - **hfjobs documentation**: https://github.com/huggingface/hfjobs +- **This guide (advanced topics)**: [uv_scripts_advanced.md](./uv_scripts_advanced.md) ## Next Steps From 6227756d8aa5df7b48be1adafbdf016bea69c059 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 10:18:24 +0100 Subject: [PATCH 10/16] remove link to advanced guide for now --- docs/uv_scripts.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/uv_scripts.md b/docs/uv_scripts.md index ac17d25..0090516 100644 --- a/docs/uv_scripts.md +++ b/docs/uv_scripts.md @@ -379,7 +379,6 @@ hfjobs run --secret HF_TOKEN=$HF_TOKEN ghcr.io/astral-sh/uv:debian-slim /bin/bas - **UV documentation**: https://docs.astral.sh/uv/ - **hfjobs documentation**: https://github.com/huggingface/hfjobs -- **This guide (advanced topics)**: [uv_scripts_advanced.md](./uv_scripts_advanced.md) ## Next Steps From ca5ce6ec78e49357ba20acf636a18a2e8a371413 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 14:25:28 +0100 Subject: [PATCH 11/16] first example scripts --- examples/README.md | 47 ++++ examples/dataset-deduplication/README.md | 104 +++++++ .../dataset-deduplication/semantic-dedupe.py | 263 ++++++++++++++++++ 3 files changed, 414 insertions(+) create mode 100644 examples/README.md create mode 100644 examples/dataset-deduplication/README.md create mode 100644 examples/dataset-deduplication/semantic-dedupe.py diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000..6e897e4 --- /dev/null +++ b/examples/README.md @@ -0,0 +1,47 @@ +# hfjobs Examples + +Production-ready examples for running workloads on Hugging Face infrastructure. + +## Available Examples + +### [Dataset Deduplication](./dataset-deduplication/) + +Remove duplicate samples from datasets using semantic similarity. Includes examples for cleaning training data and preventing train/test leakage. + +### Coming Soon + +- **Training** - Multi-node training examples +- **vLLM Inference** - Run optimized inference at scale +- **Synthetic Data Generation** - Generate high-quality synthetic datasets +- **Data Processing Pipelines** - ETL workflows for ML data + +## Quick Start + +1. **Install hfjobs**: + + ```bash + pip install hfjobs + ``` + +2. **Set your HF token**: + + ```bash + export HF_TOKEN=$(python -c "from huggingface_hub import HfFolder; print(HfFolder.get_token())") + ``` + +3. **Browse the examples** above for your use case + +## Simple Examples + +Looking for basic hfjobs usage? Check out [docs/examples/](../docs/examples/) for pedagogical examples focused on learning the basics. + +## Contributing + +To add a new example: + +1. Create a task-focused directory (e.g., `model-quantization/`) +2. Include a comprehensive README with use cases and benchmarks +3. Provide runnable scripts with clear documentation +4. Add performance metrics and cost estimates + +Each example should solve a real problem users face when scaling ML workloads. diff --git a/examples/dataset-deduplication/README.md b/examples/dataset-deduplication/README.md new file mode 100644 index 0000000..07cf9cc --- /dev/null +++ b/examples/dataset-deduplication/README.md @@ -0,0 +1,104 @@ +# Dataset Deduplication with hfjobs + +Remove duplicate samples from datasets at scale using Hugging Face infrastructure. + +## Overview + +This example demonstrates how to deduplicate datasets using semantic similarity. Unlike exact matching, semantic deduplication identifies samples that have the same meaning even if worded differently. + +## Use Cases + +- **Clean training data**: Remove redundant samples that can lead to overfitting +- **Prevent train/test leakage**: Ensure no semantic overlap between splits +- **Improve data quality**: Remove near-duplicates while preserving diversity + +## Available Scripts + +### semantic-dedupe.py + +Uses [SemHash](https://github.com/MinishLab/semhash) for semantic deduplication. Supports multiple methods: + +- `deduplicate`: Remove semantic duplicates (default) +- `filter_outliers`: Remove anomalous samples +- `find_representative`: Select diverse representative samples + +## Running on HF Infrastructure + +### Prerequisites + + + +```bash +export HF_TOKEN=$(python -c "from huggingface_hub import HfFolder; print(HfFolder.get_token())") +``` + +### Basic Usage + + + +```bash +hfjobs run --secret HF_TOKEN=$HF_TOKEN \ + ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/{username}/my-scripts/raw/main/semantic-dedupe.py \ + " +``` + +### Examples + +**Small dataset (<100k samples)**: + +```bash +hfjobs run --secret HF_TOKEN=$HF_TOKEN \ + ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/davanstrien/hfjobs-examples/raw/main/semantic-dedupe.py \ + imdb text davanstrien/imdb-deduplicated" +``` + +**Large dataset (use cpu-upgrade)**: + +```bash +hfjobs run --flavor cpu-upgrade --secret HF_TOKEN=$HF_TOKEN \ + ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/davanstrien/hfjobs-examples/raw/main/semantic-dedupe.py \ + nvidia/Nemotron-Personas persona davanstrien/Personas-deduplicated" +``` + +**With custom threshold**: + +```bash +hfjobs run --secret HF_TOKEN=$HF_TOKEN \ + ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/davanstrien/hfjobs-examples/raw/main/semantic-dedupe.py \ + squad question davanstrien/squad-dedup --threshold 0.9" +``` + +**Filter outliers instead**: + +```bash +hfjobs run --secret HF_TOKEN=$HF_TOKEN \ + ghcr.io/astral-sh/uv:debian-slim /bin/bash -c \ + "uv run https://huggingface.co/datasets/davanstrien/hfjobs-examples/raw/main/semantic-dedupe.py \ + ag_news text davanstrien/ag-news-filtered --method filter_outliers" +``` + +## Performance Tips + +1. **Test with small samples first**: Use `--max-samples 1000` to verify your setup +2. **Choose appropriate thresholds**: Lower = more aggressive deduplication +3. **Monitor progress**: Use `hfjobs logs ` to track progress + +## Output + +The script creates a new dataset repository with: + +- Deduplicated dataset in parquet format +- Dataset card with deduplication statistics +- Metadata about the deduplication process + +Example output repository: [davanstrien/imdb-deduplicated](https://huggingface.co/datasets/davanstrien/imdb-deduplicated) + +## Cost Optimization + +- Semantic deduplication is CPU-bound (embedding generation) +- GPU not required unless using custom embedding models +- For very large datasets (>10M), consider chunking the process diff --git a/examples/dataset-deduplication/semantic-dedupe.py b/examples/dataset-deduplication/semantic-dedupe.py new file mode 100644 index 0000000..aff4ef3 --- /dev/null +++ b/examples/dataset-deduplication/semantic-dedupe.py @@ -0,0 +1,263 @@ +# /// script +# requires-python = ">=3.9" +# dependencies = [ +# "semhash", +# "datasets", +# "huggingface-hub", +# "hf-transfer", +# "hf-xet", +# ] +# /// +"""Deduplicate a Hugging Face dataset using SemHash. + +This script uses semantic deduplication to remove duplicate entries from a dataset +based on a specified text column, then pushes the results to a new dataset repository. +""" + +import argparse +import os +import sys +from datetime import datetime +from typing import Optional + +from datasets import Dataset, load_dataset +from huggingface_hub import DatasetCard +from semhash import SemHash +from huggingface_hub import login + +os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = ( + "1" # Enable HF transfer to speed up transfers +) +HF_TOKEN = os.environ.get("HF_TOKEN", None) # Get Hugging Face token from environment +assert HF_TOKEN, "HF_TOKEN environment variable must be set for authentication" +login(HF_TOKEN) + + +def parse_args(): + """Parse command line arguments.""" + parser = argparse.ArgumentParser( + description="Deduplicate a Hugging Face dataset using semantic similarity" + ) + parser.add_argument( + "dataset_id", + type=str, + help="Source dataset ID (e.g., 'imdb', 'squad', 'username/dataset-name')", + ) + parser.add_argument( + "column", + type=str, + help="Column name to deduplicate on (e.g., 'text', 'question', 'context')", + ) + parser.add_argument( + "repo_id", + type=str, + help="Target repository ID for deduplicated dataset (e.g., 'username/my-deduplicated-dataset')", + ) + parser.add_argument( + "--split", + type=str, + default="train", + help="Dataset split to process (default: train)", + ) + parser.add_argument( + "--threshold", + type=float, + default=None, + help="Similarity threshold for deduplication (0-1, default: auto)", + ) + parser.add_argument( + "--method", + type=str, + choices=["deduplicate", "filter_outliers", "find_representative"], + default="deduplicate", + help="Deduplication method to use (default: deduplicate)", + ) + parser.add_argument( + "--private", + action="store_true", + help="Make the output dataset private", + ) + parser.add_argument( + "--max-samples", + type=int, + default=None, + help="Maximum number of samples to process (for testing)", + ) + + return parser.parse_args() + + +def create_dataset_card( + original_dataset_id: str, + column: str, + method: str, + duplicate_ratio: float, + original_size: int, + deduplicated_size: int, + threshold: Optional[float] = None, +) -> str: + """Create a dataset card with deduplication information.""" + card_content = f"""--- +tags: +- deduplicated +- semhash +- semantic-deduplication +- hfjobs +--- + +# Deduplicated {original_dataset_id} + +This dataset is a deduplicated version of [{original_dataset_id}](https://huggingface.co/datasets/{original_dataset_id}) +using semantic deduplication with [SemHash](https://github.com/MinishLab/semhash). + +## Deduplication Details + +- **Method**: {method} +- **Column**: `{column}` +- **Original size**: {original_size:,} samples +- **Deduplicated size**: {deduplicated_size:,} samples +- **Duplicate ratio**: {duplicate_ratio:.2%} +- **Reduction**: {(1 - deduplicated_size / original_size):.2%} +""" + + if threshold is not None: + card_content += f"- **Similarity threshold**: {threshold}\n" + + card_content += f""" +- **Date processed**: {datetime.now().strftime("%Y-%m-%d")} + +## How to use + +```python +from datasets import load_dataset + +dataset = load_dataset("{original_dataset_id.split("/")[-1]}-deduplicated") +``` + +## Processing script + +This dataset was created using the following script: + +```bash +uv run dedupe-dataset.py {original_dataset_id} {column} --method {method} +``` + +## About semantic deduplication + +Unlike exact deduplication, semantic deduplication identifies and removes samples that are +semantically similar even if they use different words. This helps create cleaner training +datasets and prevents data leakage between train/test splits. +""" + + return card_content + + +def main(): + """Main function to run deduplication.""" + args = parse_args() + + # Check for HF token + token = os.environ.get("HF_TOKEN") + if not token: + print( + "Warning: HF_TOKEN not found in environment. You may not be able to push to private repos." + ) + + # Load dataset + print(f"Loading dataset '{args.dataset_id}' (split: {args.split})...") + try: + if args.max_samples: + dataset = load_dataset( + args.dataset_id, split=f"{args.split}[:{args.max_samples}]", token=token + ) + else: + dataset = load_dataset(args.dataset_id, split=args.split, token=token) + except Exception as e: + print(f"Error loading dataset: {e}") + sys.exit(1) + + # Validate column exists + if args.column not in dataset.column_names: + print(f"Error: Column '{args.column}' not found in dataset.") + print(f"Available columns: {', '.join(dataset.column_names)}") + sys.exit(1) + + # Convert dataset to records for semhash + print(f"Preparing dataset for deduplication on column '{args.column}'...") + records = [dict(row) for row in dataset] + original_size = len(records) + print(f"Found {original_size:,} samples") + + # Initialize SemHash with the specific column + print("Initializing SemHash with default model...") + semhash = SemHash.from_records(records=records, columns=[args.column]) + + # Apply selected method + print(f"Applying {args.method} method...") + if args.method == "deduplicate": + if args.threshold: + result = semhash.self_deduplicate(threshold=args.threshold) + else: + result = semhash.self_deduplicate() + elif args.method == "filter_outliers": + result = semhash.self_filter_outliers() + elif args.method == "find_representative": + result = semhash.self_find_representative() + + # Get deduplicated records + deduplicated_records = result.selected + deduplicated_size = len(deduplicated_records) + + # Print statistics + print("\nDeduplication complete!") + print(f"Original size: {original_size:,}") + print(f"Deduplicated size: {deduplicated_size:,}") + print( + f"Removed: {original_size - deduplicated_size:,} ({result.duplicate_ratio:.2%})" + ) + + # Create new dataset from deduplicated records + print("\nCreating deduplicated dataset...") + deduplicated_dataset = Dataset.from_list(deduplicated_records) + + # Push dataset to hub first (this creates the repo) + print(f"\nPushing deduplicated dataset to '{args.repo_id}'...") + try: + deduplicated_dataset.push_to_hub( + args.repo_id, + private=args.private, + token=token, + commit_message=f"Add deduplicated version of {args.dataset_id}", + ) + print("Dataset pushed successfully!") + + # Create and push dataset card + print("Creating and pushing dataset card...") + card_content = create_dataset_card( + original_dataset_id=args.dataset_id, + column=args.column, + method=args.method, + duplicate_ratio=result.duplicate_ratio, + original_size=original_size, + deduplicated_size=deduplicated_size, + threshold=args.threshold, + ) + + card = DatasetCard(card_content) + card.push_to_hub( + repo_id=args.repo_id, + repo_type="dataset", + token=token, + commit_message="Add dataset card", + ) + + print( + f"\nSuccess! Dataset available at: https://huggingface.co/datasets/{args.repo_id}" + ) + except Exception as e: + print(f"Error: {e}") + sys.exit(1) + + +if __name__ == "__main__": + main() From eabfe64790261069c2dd331dc8039523a4eb474e Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 15:24:37 +0100 Subject: [PATCH 12/16] vllm classification example --- examples/text-classification/vllm-classify.py | 276 ++++++++++++++++++ 1 file changed, 276 insertions(+) create mode 100644 examples/text-classification/vllm-classify.py diff --git a/examples/text-classification/vllm-classify.py b/examples/text-classification/vllm-classify.py new file mode 100644 index 0000000..1c3d66d --- /dev/null +++ b/examples/text-classification/vllm-classify.py @@ -0,0 +1,276 @@ +# /// script +# requires-python = ">=3.10" +# dependencies = [ +# "datasets", +# "httpx", +# "huggingface-hub", +# "setuptools", +# "toolz", +# "transformers", +# "vllm", +# ] +# +# [[tool.uv.index]] +# url = "https://wheels.vllm.ai/nightly" +# /// + +import logging +import os +from typing import Optional + +import httpx +import torch +import torch.nn.functional as F +from datasets import load_dataset +from huggingface_hub import hf_hub_url, login +from toolz import concat, partition_all, keymap +from tqdm.auto import tqdm +from vllm import LLM +import vllm +import os + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) +# log vllm version +print(vllm.__version__) + + +def get_model_id2label(hub_model_id: str) -> Optional[dict[str, str]]: + response = httpx.get( + hf_hub_url( + hub_model_id, + filename="config.json", + ) + ) + if response.status_code != 200: + return None + try: + data = response.json() + logger.info(f"Config: {data}") + id2label = data.get("id2label") + if id2label is None: + logger.error("id2label is not found in config.json") + return None + return keymap(int, id2label) + except Exception as e: + logger.error(f"Failed to parse config.json: {e}") + return None + + +def get_top_label(output, label_map: Optional[dict[str, str]] = None): + """ + Given a ClassificationRequestOutput and a label_map (e.g. {'0': 'label0', ...}), + returns the top predicted label (or None if not found) and its confidence score. + """ + logits = torch.tensor(output.outputs.probs) + probs = F.softmax(logits, dim=0) + top_idx = torch.argmax(probs).item() + top_prob = probs[top_idx].item() + label = label_map.get(top_idx) if label_map is not None else top_idx + return label, top_prob + + +def format_prompts(dataset, inference_column, inference_columns, prompt_template, column_separator): + """Format prompts based on the provided arguments.""" + + if inference_columns: + # Multiple columns specified + columns = [col.strip() for col in inference_columns.split(',')] + + # Validate columns exist + for col in columns: + if col not in dataset.column_names: + raise ValueError(f"Column '{col}' not found in dataset. Available: {dataset.column_names}") + + if prompt_template: + # Use template formatting + prompts = [] + for row in dataset: + format_dict = {col: row[col] for col in columns} + try: + # Replace \\n with actual newlines in the template + template = prompt_template.replace('\\n', '\n') + prompt = template.format(**format_dict) + prompts.append(prompt) + except KeyError as e: + raise ValueError(f"Template placeholder {e} not found in columns: {columns}") + else: + # Join columns with separator + prompts = [ + column_separator.join(str(row[col]) for col in columns) + for row in dataset + ] + else: + # Single column (backward compatible) + if inference_column not in dataset.column_names: + raise ValueError(f"Column '{inference_column}' not found in dataset") + prompts = dataset[inference_column] + + return prompts + + +def main( + hub_model_id: str, + src_dataset_hub_id: str, + output_dataset_hub_id: str, + inference_column: str = "text", + inference_columns: Optional[str] = None, + prompt_template: Optional[str] = None, + column_separator: str = " ", + batch_size: int = 10_000, + hf_token: Optional[str] = None, +): + HF_TOKEN = hf_token or os.environ.get("HF_TOKEN") + if HF_TOKEN is not None: + login(token=HF_TOKEN) + else: + raise ValueError("HF_TOKEN is not set") + llm = LLM(model=hub_model_id, task="classify") + id2label = get_model_id2label(hub_model_id) + dataset = load_dataset(src_dataset_hub_id, split="train") + + # Format prompts based on arguments + prompts = format_prompts(dataset, inference_column, inference_columns, prompt_template, column_separator) + logger.info(f"Formatted {len(prompts)} prompts") + if prompts: + logger.info(f"Example prompt: {prompts[0][:200]}...") + all_results = [] + all_results.extend( + llm.classify(batch) for batch in tqdm(list(partition_all(batch_size, prompts))) + ) + outputs = list(concat(all_results)) + if id2label is not None: + labels_and_probs = [get_top_label(output, id2label) for output in outputs] + dataset = dataset.add_column("label", [label for label, _ in labels_and_probs]) + dataset = dataset.add_column("prob", [prob for _, prob in labels_and_probs]) + else: + # just append raw label index and probs + dataset = dataset.add_column( + "label", [output.outputs.label for output in outputs] + ) + dataset = dataset.add_column( + "prob", [output.outputs.probs for output in outputs] + ) + dataset.push_to_hub(output_dataset_hub_id, token=HF_TOKEN) + + # Create and push dataset card + from huggingface_hub import DatasetCard + + card_content = f"""--- +tags: +- text-classification +- vllm +--- + +# {output_dataset_hub_id} + +This dataset was created by classifying [{src_dataset_hub_id}](https://huggingface.co/datasets/{src_dataset_hub_id}) +using [{hub_model_id}](https://huggingface.co/{hub_model_id}). + +## Prompt Format +""" + + if inference_columns: + card_content += f"Columns used: `{inference_columns}`\n\n" + if prompt_template: + card_content += f"Template:\n```\n{prompt_template}\n```\n\n" + else: + card_content += f"Columns joined with: `{column_separator}`\n\n" + else: + card_content += f"Column used: `{inference_column}`\n\n" + + if id2label: + card_content += f"\n## Labels\n\n{', '.join([f'`{label}`' for label in id2label.values()])}\n" + + card_content += f"\n## Processing Details\n\n- Batch size: {batch_size:,}\n- Date: {os.popen('date').read().strip()}\n" + + card = DatasetCard(card_content) + card.push_to_hub(output_dataset_hub_id, repo_type="dataset", token=HF_TOKEN) + logger.info(f"Dataset and card pushed to: https://huggingface.co/datasets/{output_dataset_hub_id}") + + +if __name__ == "__main__": + import argparse + + parser = argparse.ArgumentParser( + prog="main.py", + description="Classify a dataset using a Hugging Face model and save results to Hugging Face Hub", + ) + parser.add_argument( + "hub_model_id", + type=str, + help="Hugging Face model ID to use for classification", + ) + parser.add_argument( + "src_dataset_hub_id", + type=str, + help="Source dataset ID on Hugging Face Hub", + ) + parser.add_argument( + "output_dataset_hub_id", + type=str, + help="Output dataset ID on Hugging Face Hub", + ) + parser.add_argument( + "--inference-column", + type=str, + default="text", + help="Column name containing text to classify (default: text)", + ) + parser.add_argument( + "--inference-columns", + type=str, + help="Comma-separated list of columns to combine (e.g., 'title,abstract')" + ) + parser.add_argument( + "--prompt-template", + type=str, + help="Template string with placeholders (e.g., 'Title: {title}\\nAbstract: {abstract}')" + ) + parser.add_argument( + "--column-separator", + type=str, + default=" ", + help="Separator when joining columns without template (default: space)" + ) + parser.add_argument( + "--batch-size", + type=int, + default=10_000, + help="Batch size for inference (default: 10000)", + ) + parser.add_argument( + "--hf-token", + type=str, + default=None, + help="Hugging Face token (default: None)", + ) + + args = parser.parse_args() + main( + hub_model_id=args.hub_model_id, + src_dataset_hub_id=args.src_dataset_hub_id, + output_dataset_hub_id=args.output_dataset_hub_id, + inference_column=args.inference_column, + inference_columns=args.inference_columns, + prompt_template=args.prompt_template, + column_separator=args.column_separator, + batch_size=args.batch_size, + hf_token=args.hf_token, + ) + +# hfjobs run --flavor l4x1 \ +# --secret HF_TOKEN=hf_*** \ +# ghcr.io/astral-sh/uv:debian \ +# /bin/bash -c " +# export HOME=/tmp && \ +# export USER=dummy && \ +# export TORCHINDUCTOR_CACHE_DIR=/tmp/torch-inductor && \ +# uv run https://huggingface.co/datasets/davanstrien/dataset-creation-scripts/raw/main/vllm-bert-classify-dataset/main.py \ +# davanstrien/ModernBERT-base-is-new-arxiv-dataset \ +# davanstrien/testarxiv \ +# davanstrien/testarxiv-out \ +# --inference-column prompt \ +# --batch-size 100000" \ +# --project vllm-classify \ +# --name testarxiv-classify From 4efe10a782600009aa9c89633009b5bdae4c9552 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 27 Jun 2025 15:55:11 +0100 Subject: [PATCH 13/16] add support for multiple gpus --- examples/text-classification/vllm-classify.py | 76 ++++++++++++------- 1 file changed, 48 insertions(+), 28 deletions(-) diff --git a/examples/text-classification/vllm-classify.py b/examples/text-classification/vllm-classify.py index 1c3d66d..aee8e14 100644 --- a/examples/text-classification/vllm-classify.py +++ b/examples/text-classification/vllm-classify.py @@ -2,9 +2,10 @@ # requires-python = ">=3.10" # dependencies = [ # "datasets", +# "hf-transfer", +# "hf-xet", # "httpx", # "huggingface-hub", -# "setuptools", # "toolz", # "transformers", # "vllm", @@ -29,6 +30,7 @@ import vllm import os + logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) # log vllm version @@ -70,18 +72,22 @@ def get_top_label(output, label_map: Optional[dict[str, str]] = None): return label, top_prob -def format_prompts(dataset, inference_column, inference_columns, prompt_template, column_separator): +def format_prompts( + dataset, inference_column, inference_columns, prompt_template, column_separator +): """Format prompts based on the provided arguments.""" - + if inference_columns: # Multiple columns specified - columns = [col.strip() for col in inference_columns.split(',')] - + columns = [col.strip() for col in inference_columns.split(",")] + # Validate columns exist for col in columns: if col not in dataset.column_names: - raise ValueError(f"Column '{col}' not found in dataset. Available: {dataset.column_names}") - + raise ValueError( + f"Column '{col}' not found in dataset. Available: {dataset.column_names}" + ) + if prompt_template: # Use template formatting prompts = [] @@ -89,23 +95,24 @@ def format_prompts(dataset, inference_column, inference_columns, prompt_template format_dict = {col: row[col] for col in columns} try: # Replace \\n with actual newlines in the template - template = prompt_template.replace('\\n', '\n') + template = prompt_template.replace("\\n", "\n") prompt = template.format(**format_dict) prompts.append(prompt) except KeyError as e: - raise ValueError(f"Template placeholder {e} not found in columns: {columns}") + raise ValueError( + f"Template placeholder {e} not found in columns: {columns}" + ) from e else: # Join columns with separator prompts = [ column_separator.join(str(row[col]) for col in columns) for row in dataset ] - else: - # Single column (backward compatible) - if inference_column not in dataset.column_names: - raise ValueError(f"Column '{inference_column}' not found in dataset") + elif inference_column in dataset.column_names: prompts = dataset[inference_column] - + + else: + raise ValueError(f"Column '{inference_column}' not found in dataset") return prompts @@ -125,12 +132,23 @@ def main( login(token=HF_TOKEN) else: raise ValueError("HF_TOKEN is not set") - llm = LLM(model=hub_model_id, task="classify") + # Auto-detect number of GPUs + num_gpus = torch.cuda.device_count() + logger.info(f"Detected {num_gpus} GPU(s)") + + # Initialize LLM with tensor parallel size equal to number of GPUs + llm = LLM( + model=hub_model_id, + task="classify", + tensor_parallel_size=num_gpus if num_gpus > 0 else 1, + ) id2label = get_model_id2label(hub_model_id) dataset = load_dataset(src_dataset_hub_id, split="train") - + # Format prompts based on arguments - prompts = format_prompts(dataset, inference_column, inference_columns, prompt_template, column_separator) + prompts = format_prompts( + dataset, inference_column, inference_columns, prompt_template, column_separator + ) logger.info(f"Formatted {len(prompts)} prompts") if prompts: logger.info(f"Example prompt: {prompts[0][:200]}...") @@ -152,10 +170,10 @@ def main( "prob", [output.outputs.probs for output in outputs] ) dataset.push_to_hub(output_dataset_hub_id, token=HF_TOKEN) - + # Create and push dataset card from huggingface_hub import DatasetCard - + card_content = f"""--- tags: - text-classification @@ -169,7 +187,7 @@ def main( ## Prompt Format """ - + if inference_columns: card_content += f"Columns used: `{inference_columns}`\n\n" if prompt_template: @@ -178,15 +196,17 @@ def main( card_content += f"Columns joined with: `{column_separator}`\n\n" else: card_content += f"Column used: `{inference_column}`\n\n" - + if id2label: card_content += f"\n## Labels\n\n{', '.join([f'`{label}`' for label in id2label.values()])}\n" - - card_content += f"\n## Processing Details\n\n- Batch size: {batch_size:,}\n- Date: {os.popen('date').read().strip()}\n" - + + card_content += f"\n## Processing Details\n\n- Batch size: {batch_size:,}\n- GPUs used: {num_gpus}\n- Date: {os.popen('date').read().strip()}\n" + card = DatasetCard(card_content) card.push_to_hub(output_dataset_hub_id, repo_type="dataset", token=HF_TOKEN) - logger.info(f"Dataset and card pushed to: https://huggingface.co/datasets/{output_dataset_hub_id}") + logger.info( + f"Dataset and card pushed to: https://huggingface.co/datasets/{output_dataset_hub_id}" + ) if __name__ == "__main__": @@ -220,18 +240,18 @@ def main( parser.add_argument( "--inference-columns", type=str, - help="Comma-separated list of columns to combine (e.g., 'title,abstract')" + help="Comma-separated list of columns to combine (e.g., 'title,abstract')", ) parser.add_argument( "--prompt-template", type=str, - help="Template string with placeholders (e.g., 'Title: {title}\\nAbstract: {abstract}')" + help="Template string with placeholders (e.g., 'Title: {title}\\nAbstract: {abstract}')", ) parser.add_argument( "--column-separator", type=str, default=" ", - help="Separator when joining columns without template (default: space)" + help="Separator when joining columns without template (default: space)", ) parser.add_argument( "--batch-size", From cd92f920cf8ed0210599f2b1e5dd3371048d4196 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Thu, 3 Jul 2025 14:46:49 +0100 Subject: [PATCH 14/16] feat: add UV script sharing commands - Add 'hfjobs scripts init' to create HF dataset repos for UV scripts - Add 'hfjobs scripts push' to update scripts in existing repos - Auto-generate README with usage instructions - Add 'hfjobs-uv-script' tag for discovery - Add documentation for UV script sharing This minimal MVP makes it easy to share and run UV scripts with hfjobs --- README.md | 29 ++- docs/uv-script-sharing.md | 138 ++++++++++++ hfjobs/cli.py | 2 + hfjobs/commands/scripts.py | 431 +++++++++++++++++++++++++++++++++++++ 4 files changed, 599 insertions(+), 1 deletion(-) create mode 100644 docs/uv-script-sharing.md create mode 100644 hfjobs/commands/scripts.py diff --git a/README.md b/README.md index ae376c8..df56af3 100644 --- a/README.md +++ b/README.md @@ -14,13 +14,14 @@ pip install hfjobs usage: hfjobs [] positional arguments: - {inspect,logs,ps,run,cancel} + {inspect,logs,ps,run,cancel,scripts} hfjobs command helpers inspect Display detailed information on one or more Jobs logs Fetch the logs of a Job ps List Jobs run Run a Job cancel Cancel a Job + scripts Share and manage UV scripts on Hugging Face Hub options: -h, --help show this help message and exit @@ -95,3 +96,29 @@ Available `--flavor` options: - TPU: `v5e-1x1`, `v5e-2x2`, `v5e-2x4` (updated in 03/25 from Hugging Face [suggested_hardware docs](https://huggingface.co/docs/hub/en/spaces-config-reference)) + +## UV Script Sharing + +Share and run UV scripts easily on the Hugging Face Hub: + +### Share a script + +```bash +# Create a new repository with your UV script +hfjobs scripts init my-awesome-script my-script.py + +# Or create a template to get started +hfjobs scripts init my-new-script +``` + +### Run shared scripts + +Once shared, anyone can run your script: + +```bash +hfjobs run ghcr.io/astral-sh/uv:python3.12 \ + uv run https://huggingface.co/datasets/username/my-script/resolve/main/script.py \ + +``` + +See the [UV script sharing guide](docs/uv-script-sharing.md) for more details. diff --git a/docs/uv-script-sharing.md b/docs/uv-script-sharing.md new file mode 100644 index 0000000..c349c9e --- /dev/null +++ b/docs/uv-script-sharing.md @@ -0,0 +1,138 @@ +# UV Script Sharing with hfjobs + +This guide explains how to share UV scripts on the Hugging Face Hub using the new `hfjobs scripts` commands. + +## Overview + +The `hfjobs scripts` commands provide a simple way to: +- Share UV scripts as Hugging Face dataset repositories +- Make scripts easily discoverable with standardized tags +- Generate usage instructions automatically +- Run shared scripts with a simple copy-paste command + +## Quick Start + +### 1. Share a UV Script + +To share an existing UV script: + +```bash +hfjobs scripts init my-awesome-script my-script.py +``` + +This will: +- Create a new dataset repository (e.g., `username/my-awesome-script`) +- Upload your UV script +- Generate a README with usage instructions +- Tag the repository with `hfjobs-uv-script` for discovery + +### 2. Create a Template Script + +If you don't have a script yet, omit the script argument to create a template: + +```bash +hfjobs scripts init my-new-script +``` + +This creates a repository with a template UV script that you can customize. + +### 3. Add More Scripts + +To add additional scripts to an existing repository: + +```bash +hfjobs scripts push another-script.py +``` + +The README will be automatically updated with the new script. + +## Running Shared Scripts + +Once a script is shared, anyone can run it using the command shown in the repository README: + +```bash +hfjobs run ghcr.io/astral-sh/uv:python3.12 \ + uv run https://huggingface.co/datasets/username/my-script/resolve/main/script.py \ + +``` + +## Example UV Script + +Here's a simple UV script that filters a dataset by text length: + +```python +# /// script +# requires-python = ">=3.10" +# dependencies = [ +# "datasets", +# "pandas", +# ] +# /// +"""Filter dataset by text length.""" + +import argparse +from datasets import load_dataset + +def main(): + parser = argparse.ArgumentParser(description="Filter dataset by text length") + parser.add_argument("input_dataset", help="Input dataset from HF Hub") + parser.add_argument("output_dataset", help="Output dataset name") + parser.add_argument("--min-length", type=int, default=10, help="Minimum text length") + + args = parser.parse_args() + + dataset = load_dataset(args.input_dataset, split="train") + filtered = dataset.filter(lambda x: len(x["text"]) >= args.min_length) + filtered.push_to_hub(args.output_dataset) + print("āœ… Done!") + +if __name__ == "__main__": + main() +``` + +## Repository Structure + +A UV script repository has a simple structure: + +``` +username/my-script/ +ā”œā”€ā”€ script.py # Your UV script(s) +└── README.md # Auto-generated usage instructions +``` + +## Discovery + +All scripts shared with `hfjobs scripts init` are automatically tagged with: +- `hfjobs-uv-script` +- `uv` +- `python` + +This makes them easy to find on the Hugging Face Hub. + +## Best Practices + +1. **Use descriptive repository names** - Make it clear what your script does +2. **Add docstrings** - The first line becomes the description in the README +3. **Include usage examples** - Add examples in your script's docstring +4. **Specify dependencies clearly** - Use the UV script header format +5. **Test locally first** - Ensure your script works before sharing + +## Private Scripts + +To create a private repository for internal use: + +```bash +hfjobs scripts init my-private-script script.py --private +``` + +## Tips + +- The last initialized repository is remembered, so you can use `hfjobs scripts push` without specifying `--repo` +- Script dependencies are automatically extracted and shown in the README +- Multiple scripts in one repository are supported and organized in the README + +## Next Steps + +- Browse existing UV scripts: Search for the `hfjobs-uv-script` tag on [Hugging Face Hub](https://huggingface.co/datasets) +- Share your own scripts to help the community +- Contribute improvements to the [hfjobs repository](https://github.com/huggingface/hfjobs) \ No newline at end of file diff --git a/hfjobs/cli.py b/hfjobs/cli.py index 03934ed..2622f7e 100644 --- a/hfjobs/cli.py +++ b/hfjobs/cli.py @@ -5,6 +5,7 @@ from .commands.ps import PsCommand from .commands.run import RunCommand from .commands.cancel import CancelCommand +from .commands.scripts import ScriptsCommand def main(): @@ -17,6 +18,7 @@ def main(): PsCommand.register_subcommand(commands_parser) RunCommand.register_subcommand(commands_parser) CancelCommand.register_subcommand(commands_parser) + ScriptsCommand.register_subcommand(commands_parser) # Let's go args = parser.parse_args() diff --git a/hfjobs/commands/scripts.py b/hfjobs/commands/scripts.py new file mode 100644 index 0000000..5d4da14 --- /dev/null +++ b/hfjobs/commands/scripts.py @@ -0,0 +1,431 @@ +"""UV script sharing commands for hfjobs.""" + +import os +import re +import sys +from pathlib import Path +from typing import Optional, List, Dict, Any + +from huggingface_hub import HfApi, create_repo +from huggingface_hub.utils import RepositoryNotFoundError + +from . import BaseCommand + + +class ScriptsCommand(BaseCommand): + """Manage UV scripts on Hugging Face Hub.""" + + @staticmethod + def register_subcommand(parser): + """Register UV script subcommands.""" + scripts_parser = parser.add_parser( + "scripts", + help="Share and manage UV scripts on Hugging Face Hub", + description="Commands for sharing UV scripts as Hugging Face datasets" + ) + + subparsers = scripts_parser.add_subparsers( + dest="scripts_command", + help="Scripts commands", + required=True + ) + + # Init command + init_parser = subparsers.add_parser( + "init", + help="Initialize a new UV script repository", + description="Create a Hugging Face dataset repository for sharing UV scripts" + ) + init_parser.add_argument( + "repo", + help="Repository name (e.g., 'username/my-script' or just 'my-script')" + ) + init_parser.add_argument( + "script", + nargs="?", + help="UV script to upload (creates template if not provided)" + ) + init_parser.add_argument( + "--private", + action="store_true", + help="Make the repository private" + ) + + # Push command + push_parser = subparsers.add_parser( + "push", + help="Push a UV script to an existing repository", + description="Update or add UV scripts to a repository" + ) + push_parser.add_argument( + "script", + help="UV script to push" + ) + push_parser.add_argument( + "--repo", + help="Repository to push to (uses last initialized repo if not specified)" + ) + + def run(self, args): + """Execute scripts command.""" + if args.scripts_command == "init": + self._init_repo(args) + elif args.scripts_command == "push": + self._push_script(args) + + def _init_repo(self, args): + """Initialize a new UV script repository.""" + api = HfApi() + + # Ensure repo name includes username + repo_id = args.repo + if "/" not in repo_id: + user_info = api.whoami() + username = user_info["name"] + repo_id = f"{username}/{repo_id}" + + # Create repository + print(f"Creating repository: {repo_id}") + try: + create_repo( + repo_id, + repo_type="dataset", + private=args.private, + exist_ok=False + ) + except Exception as e: + if "already exists" in str(e): + print(f"Error: Repository {repo_id} already exists") + return + raise + + # Upload script or create template + if args.script: + script_path = Path(args.script) + if not script_path.exists(): + print(f"Error: Script not found: {args.script}") + return + + script_name = script_path.name + print(f"Uploading script: {script_name}") + + # Read script content + with open(script_path, 'r') as f: + script_content = f.read() + + # Upload script + api.upload_file( + path_or_fileobj=script_content.encode(), + path_in_repo=script_name, + repo_id=repo_id, + repo_type="dataset" + ) + else: + # Create a template script + script_name = "script.py" + script_content = self._create_template_script() + + print(f"Creating template script: {script_name}") + api.upload_file( + path_or_fileobj=script_content.encode(), + path_in_repo=script_name, + repo_id=repo_id, + repo_type="dataset" + ) + + # Create README + print("Creating README with usage instructions") + readme_content = self._create_readme(repo_id, script_name, script_content) + + api.upload_file( + path_or_fileobj=readme_content.encode(), + path_in_repo="README.md", + repo_id=repo_id, + repo_type="dataset" + ) + + # Save last repo for push command + self._save_last_repo(repo_id) + + print(f"\nāœ… Script published to: https://huggingface.co/datasets/{repo_id}") + print(f"\nRun your script with:") + print(f"hfjobs run ghcr.io/astral-sh/uv:python3.12 \\") + print(f" uv run https://huggingface.co/datasets/{repo_id}/resolve/main/{script_name} \\") + print(f" ") + + def _push_script(self, args): + """Push a script to an existing repository.""" + api = HfApi() + + # Get repository + repo_id = args.repo or self._get_last_repo() + if not repo_id: + print("Error: No repository specified and no previous repository found") + print("Use --repo to specify a repository or run 'hfjobs scripts init' first") + return + + # Check script exists + script_path = Path(args.script) + if not script_path.exists(): + print(f"Error: Script not found: {args.script}") + return + + # Check repository exists + try: + api.repo_info(repo_id, repo_type="dataset") + except RepositoryNotFoundError: + print(f"Error: Repository not found: {repo_id}") + return + + script_name = script_path.name + + # Read script content + with open(script_path, 'r') as f: + script_content = f.read() + + print(f"Uploading: {script_name}") + + # Upload script + api.upload_file( + path_or_fileobj=script_content.encode(), + path_in_repo=script_name, + repo_id=repo_id, + repo_type="dataset" + ) + + # Update README + print("Updating README...") + try: + # Download existing README + readme_path = api.hf_hub_download( + repo_id=repo_id, + filename="README.md", + repo_type="dataset" + ) + with open(readme_path, 'r') as f: + readme_content = f.read() + + # Update README if this is a new script + if script_name not in readme_content: + readme_content = self._update_readme_for_new_script( + readme_content, repo_id, script_name, script_content + ) + + api.upload_file( + path_or_fileobj=readme_content.encode(), + path_in_repo="README.md", + repo_id=repo_id, + repo_type="dataset" + ) + except Exception as e: + print(f"Warning: Could not update README: {e}") + + print(f"āœ… Script added to repository") + print(f"View at: https://huggingface.co/datasets/{repo_id}") + + def _create_template_script(self) -> str: + """Create a template UV script.""" + return '''# /// script +# requires-python = ">=3.10" +# dependencies = [ +# "datasets", +# "tqdm", +# ] +# /// +"""Template UV script for hfjobs. + +This is a template script. Customize it for your needs! + +Usage: + python script.py [--option value] +""" + +import argparse +from datasets import load_dataset +from tqdm import tqdm + + +def main(): + parser = argparse.ArgumentParser(description="Template UV script") + parser.add_argument("input", help="Input dataset") + parser.add_argument("output", help="Output dataset") + parser.add_argument("--text-column", default="text", help="Text column name") + + args = parser.parse_args() + + print(f"Loading dataset: {args.input}") + dataset = load_dataset(args.input, split="train") + + # Your processing logic here + print(f"Processing {len(dataset)} examples...") + + # Example: simple transformation + def process_example(example): + # Add your transformation logic + return example + + processed = dataset.map(process_example, desc="Processing") + + print(f"Pushing to: {args.output}") + processed.push_to_hub(args.output) + + print("āœ… Done!") + + +if __name__ == "__main__": + main() +''' + + def _create_readme(self, repo_id: str, script_name: str, script_content: str) -> str: + """Create README content for the repository.""" + # Extract script info + deps = self._extract_dependencies(script_content) + description = self._extract_description(script_content) + + readme = f"""--- +tags: +- hfjobs-uv-script +- uv +- python +--- + +# {repo_id.split('/')[-1]} + +A UV script for hfjobs. + +## Usage + +```bash +hfjobs run ghcr.io/astral-sh/uv:python3.12 \\ + uv run https://huggingface.co/datasets/{repo_id}/resolve/main/{script_name} \\ + +``` + +## Script Details + +**Script:** `{script_name}` +""" + + if description: + readme += f"\n**Description:** {description}\n" + + if deps: + readme += "\n**Dependencies:**\n" + for dep in deps: + readme += f"- {dep}\n" + + readme += """ +--- +*Created with [hfjobs](https://github.com/huggingface/hfjobs)* +""" + + return readme + + def _update_readme_for_new_script( + self, readme: str, repo_id: str, script_name: str, script_content: str + ) -> str: + """Update README when adding a new script.""" + # Extract script info + deps = self._extract_dependencies(script_content) + description = self._extract_description(script_content) + + # Find where to insert new script info + if "## Scripts" not in readme: + # Add Scripts section before the footer + footer_marker = "---\n*Created with" + if footer_marker in readme: + before_footer = readme.split(footer_marker)[0] + footer = footer_marker + readme.split(footer_marker)[1] + else: + before_footer = readme + footer = "\n---\n*Created with [hfjobs](https://github.com/huggingface/hfjobs)*\n" + + scripts_section = "\n## Scripts\n\n" + else: + # Insert into existing Scripts section + parts = readme.split("## Scripts") + before_scripts = parts[0] + "## Scripts" + + # Find next section or footer + remaining = parts[1] + next_section_match = re.search(r'\n## ', remaining) + footer_match = re.search(r'\n---\n\*Created with', remaining) + + if next_section_match: + scripts_content = remaining[:next_section_match.start()] + after_scripts = remaining[next_section_match.start():] + elif footer_match: + scripts_content = remaining[:footer_match.start()] + after_scripts = remaining[footer_match.start():] + else: + scripts_content = remaining + after_scripts = "" + + before_footer = before_scripts + scripts_content + footer = after_scripts + scripts_section = "" + + # Add new script info + script_info = f""" +### {script_name} +""" + if description: + script_info += f"{description}\n\n" + + script_info += f"""```bash +hfjobs run ghcr.io/astral-sh/uv:python3.12 \\ + uv run https://huggingface.co/datasets/{repo_id}/resolve/main/{script_name} +``` +""" + + if deps: + script_info += "\n**Dependencies:** " + ", ".join(deps) + "\n" + + return before_footer + scripts_section + script_info + footer + + def _extract_dependencies(self, script_content: str) -> List[str]: + """Extract dependencies from UV script header.""" + deps = [] + in_deps = False + + for line in script_content.split('\n'): + if 'dependencies = [' in line: + in_deps = True + continue + if in_deps: + if ']' in line: + break + dep_match = re.search(r'"([^"]+)"', line) + if dep_match: + deps.append(dep_match.group(1)) + + return deps + + def _extract_description(self, script_content: str) -> Optional[str]: + """Extract description from script docstring.""" + # Look for docstring + docstring_match = re.search(r'"""(.*?)"""', script_content, re.DOTALL) + if docstring_match: + lines = docstring_match.group(1).strip().split('\n') + if lines: + # Return first non-empty line + for line in lines: + line = line.strip() + if line: + return line + return None + + def _save_last_repo(self, repo_id: str): + """Save the last repository for future push commands.""" + config_dir = Path.home() / ".hfjobs" + config_dir.mkdir(exist_ok=True) + + config_file = config_dir / "last_uv_repo" + config_file.write_text(repo_id) + + def _get_last_repo(self) -> Optional[str]: + """Get the last repository used.""" + config_file = Path.home() / ".hfjobs" / "last_uv_repo" + if config_file.exists(): + return config_file.read_text().strip() + return None \ No newline at end of file From aa51a00d19f7ed7813e2c0dc8840a08e0f29ba5b Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 4 Jul 2025 10:51:26 +0100 Subject: [PATCH 15/16] refactor: rename scripts command to uv for consistency - Rename ScriptsCommand to UvCommand - Update imports in cli.py - Maintains all existing functionality --- hfjobs/cli.py | 4 +- hfjobs/commands/scripts.py | 431 ------------------- hfjobs/commands/uv.py | 840 +++++++++++++++++++++++++++++++++++++ 3 files changed, 842 insertions(+), 433 deletions(-) delete mode 100644 hfjobs/commands/scripts.py create mode 100644 hfjobs/commands/uv.py diff --git a/hfjobs/cli.py b/hfjobs/cli.py index 2622f7e..35df323 100644 --- a/hfjobs/cli.py +++ b/hfjobs/cli.py @@ -5,7 +5,7 @@ from .commands.ps import PsCommand from .commands.run import RunCommand from .commands.cancel import CancelCommand -from .commands.scripts import ScriptsCommand +from .commands.uv import UvCommand def main(): @@ -18,7 +18,7 @@ def main(): PsCommand.register_subcommand(commands_parser) RunCommand.register_subcommand(commands_parser) CancelCommand.register_subcommand(commands_parser) - ScriptsCommand.register_subcommand(commands_parser) + UvCommand.register_subcommand(commands_parser) # Let's go args = parser.parse_args() diff --git a/hfjobs/commands/scripts.py b/hfjobs/commands/scripts.py deleted file mode 100644 index 5d4da14..0000000 --- a/hfjobs/commands/scripts.py +++ /dev/null @@ -1,431 +0,0 @@ -"""UV script sharing commands for hfjobs.""" - -import os -import re -import sys -from pathlib import Path -from typing import Optional, List, Dict, Any - -from huggingface_hub import HfApi, create_repo -from huggingface_hub.utils import RepositoryNotFoundError - -from . import BaseCommand - - -class ScriptsCommand(BaseCommand): - """Manage UV scripts on Hugging Face Hub.""" - - @staticmethod - def register_subcommand(parser): - """Register UV script subcommands.""" - scripts_parser = parser.add_parser( - "scripts", - help="Share and manage UV scripts on Hugging Face Hub", - description="Commands for sharing UV scripts as Hugging Face datasets" - ) - - subparsers = scripts_parser.add_subparsers( - dest="scripts_command", - help="Scripts commands", - required=True - ) - - # Init command - init_parser = subparsers.add_parser( - "init", - help="Initialize a new UV script repository", - description="Create a Hugging Face dataset repository for sharing UV scripts" - ) - init_parser.add_argument( - "repo", - help="Repository name (e.g., 'username/my-script' or just 'my-script')" - ) - init_parser.add_argument( - "script", - nargs="?", - help="UV script to upload (creates template if not provided)" - ) - init_parser.add_argument( - "--private", - action="store_true", - help="Make the repository private" - ) - - # Push command - push_parser = subparsers.add_parser( - "push", - help="Push a UV script to an existing repository", - description="Update or add UV scripts to a repository" - ) - push_parser.add_argument( - "script", - help="UV script to push" - ) - push_parser.add_argument( - "--repo", - help="Repository to push to (uses last initialized repo if not specified)" - ) - - def run(self, args): - """Execute scripts command.""" - if args.scripts_command == "init": - self._init_repo(args) - elif args.scripts_command == "push": - self._push_script(args) - - def _init_repo(self, args): - """Initialize a new UV script repository.""" - api = HfApi() - - # Ensure repo name includes username - repo_id = args.repo - if "/" not in repo_id: - user_info = api.whoami() - username = user_info["name"] - repo_id = f"{username}/{repo_id}" - - # Create repository - print(f"Creating repository: {repo_id}") - try: - create_repo( - repo_id, - repo_type="dataset", - private=args.private, - exist_ok=False - ) - except Exception as e: - if "already exists" in str(e): - print(f"Error: Repository {repo_id} already exists") - return - raise - - # Upload script or create template - if args.script: - script_path = Path(args.script) - if not script_path.exists(): - print(f"Error: Script not found: {args.script}") - return - - script_name = script_path.name - print(f"Uploading script: {script_name}") - - # Read script content - with open(script_path, 'r') as f: - script_content = f.read() - - # Upload script - api.upload_file( - path_or_fileobj=script_content.encode(), - path_in_repo=script_name, - repo_id=repo_id, - repo_type="dataset" - ) - else: - # Create a template script - script_name = "script.py" - script_content = self._create_template_script() - - print(f"Creating template script: {script_name}") - api.upload_file( - path_or_fileobj=script_content.encode(), - path_in_repo=script_name, - repo_id=repo_id, - repo_type="dataset" - ) - - # Create README - print("Creating README with usage instructions") - readme_content = self._create_readme(repo_id, script_name, script_content) - - api.upload_file( - path_or_fileobj=readme_content.encode(), - path_in_repo="README.md", - repo_id=repo_id, - repo_type="dataset" - ) - - # Save last repo for push command - self._save_last_repo(repo_id) - - print(f"\nāœ… Script published to: https://huggingface.co/datasets/{repo_id}") - print(f"\nRun your script with:") - print(f"hfjobs run ghcr.io/astral-sh/uv:python3.12 \\") - print(f" uv run https://huggingface.co/datasets/{repo_id}/resolve/main/{script_name} \\") - print(f" ") - - def _push_script(self, args): - """Push a script to an existing repository.""" - api = HfApi() - - # Get repository - repo_id = args.repo or self._get_last_repo() - if not repo_id: - print("Error: No repository specified and no previous repository found") - print("Use --repo to specify a repository or run 'hfjobs scripts init' first") - return - - # Check script exists - script_path = Path(args.script) - if not script_path.exists(): - print(f"Error: Script not found: {args.script}") - return - - # Check repository exists - try: - api.repo_info(repo_id, repo_type="dataset") - except RepositoryNotFoundError: - print(f"Error: Repository not found: {repo_id}") - return - - script_name = script_path.name - - # Read script content - with open(script_path, 'r') as f: - script_content = f.read() - - print(f"Uploading: {script_name}") - - # Upload script - api.upload_file( - path_or_fileobj=script_content.encode(), - path_in_repo=script_name, - repo_id=repo_id, - repo_type="dataset" - ) - - # Update README - print("Updating README...") - try: - # Download existing README - readme_path = api.hf_hub_download( - repo_id=repo_id, - filename="README.md", - repo_type="dataset" - ) - with open(readme_path, 'r') as f: - readme_content = f.read() - - # Update README if this is a new script - if script_name not in readme_content: - readme_content = self._update_readme_for_new_script( - readme_content, repo_id, script_name, script_content - ) - - api.upload_file( - path_or_fileobj=readme_content.encode(), - path_in_repo="README.md", - repo_id=repo_id, - repo_type="dataset" - ) - except Exception as e: - print(f"Warning: Could not update README: {e}") - - print(f"āœ… Script added to repository") - print(f"View at: https://huggingface.co/datasets/{repo_id}") - - def _create_template_script(self) -> str: - """Create a template UV script.""" - return '''# /// script -# requires-python = ">=3.10" -# dependencies = [ -# "datasets", -# "tqdm", -# ] -# /// -"""Template UV script for hfjobs. - -This is a template script. Customize it for your needs! - -Usage: - python script.py [--option value] -""" - -import argparse -from datasets import load_dataset -from tqdm import tqdm - - -def main(): - parser = argparse.ArgumentParser(description="Template UV script") - parser.add_argument("input", help="Input dataset") - parser.add_argument("output", help="Output dataset") - parser.add_argument("--text-column", default="text", help="Text column name") - - args = parser.parse_args() - - print(f"Loading dataset: {args.input}") - dataset = load_dataset(args.input, split="train") - - # Your processing logic here - print(f"Processing {len(dataset)} examples...") - - # Example: simple transformation - def process_example(example): - # Add your transformation logic - return example - - processed = dataset.map(process_example, desc="Processing") - - print(f"Pushing to: {args.output}") - processed.push_to_hub(args.output) - - print("āœ… Done!") - - -if __name__ == "__main__": - main() -''' - - def _create_readme(self, repo_id: str, script_name: str, script_content: str) -> str: - """Create README content for the repository.""" - # Extract script info - deps = self._extract_dependencies(script_content) - description = self._extract_description(script_content) - - readme = f"""--- -tags: -- hfjobs-uv-script -- uv -- python ---- - -# {repo_id.split('/')[-1]} - -A UV script for hfjobs. - -## Usage - -```bash -hfjobs run ghcr.io/astral-sh/uv:python3.12 \\ - uv run https://huggingface.co/datasets/{repo_id}/resolve/main/{script_name} \\ - -``` - -## Script Details - -**Script:** `{script_name}` -""" - - if description: - readme += f"\n**Description:** {description}\n" - - if deps: - readme += "\n**Dependencies:**\n" - for dep in deps: - readme += f"- {dep}\n" - - readme += """ ---- -*Created with [hfjobs](https://github.com/huggingface/hfjobs)* -""" - - return readme - - def _update_readme_for_new_script( - self, readme: str, repo_id: str, script_name: str, script_content: str - ) -> str: - """Update README when adding a new script.""" - # Extract script info - deps = self._extract_dependencies(script_content) - description = self._extract_description(script_content) - - # Find where to insert new script info - if "## Scripts" not in readme: - # Add Scripts section before the footer - footer_marker = "---\n*Created with" - if footer_marker in readme: - before_footer = readme.split(footer_marker)[0] - footer = footer_marker + readme.split(footer_marker)[1] - else: - before_footer = readme - footer = "\n---\n*Created with [hfjobs](https://github.com/huggingface/hfjobs)*\n" - - scripts_section = "\n## Scripts\n\n" - else: - # Insert into existing Scripts section - parts = readme.split("## Scripts") - before_scripts = parts[0] + "## Scripts" - - # Find next section or footer - remaining = parts[1] - next_section_match = re.search(r'\n## ', remaining) - footer_match = re.search(r'\n---\n\*Created with', remaining) - - if next_section_match: - scripts_content = remaining[:next_section_match.start()] - after_scripts = remaining[next_section_match.start():] - elif footer_match: - scripts_content = remaining[:footer_match.start()] - after_scripts = remaining[footer_match.start():] - else: - scripts_content = remaining - after_scripts = "" - - before_footer = before_scripts + scripts_content - footer = after_scripts - scripts_section = "" - - # Add new script info - script_info = f""" -### {script_name} -""" - if description: - script_info += f"{description}\n\n" - - script_info += f"""```bash -hfjobs run ghcr.io/astral-sh/uv:python3.12 \\ - uv run https://huggingface.co/datasets/{repo_id}/resolve/main/{script_name} -``` -""" - - if deps: - script_info += "\n**Dependencies:** " + ", ".join(deps) + "\n" - - return before_footer + scripts_section + script_info + footer - - def _extract_dependencies(self, script_content: str) -> List[str]: - """Extract dependencies from UV script header.""" - deps = [] - in_deps = False - - for line in script_content.split('\n'): - if 'dependencies = [' in line: - in_deps = True - continue - if in_deps: - if ']' in line: - break - dep_match = re.search(r'"([^"]+)"', line) - if dep_match: - deps.append(dep_match.group(1)) - - return deps - - def _extract_description(self, script_content: str) -> Optional[str]: - """Extract description from script docstring.""" - # Look for docstring - docstring_match = re.search(r'"""(.*?)"""', script_content, re.DOTALL) - if docstring_match: - lines = docstring_match.group(1).strip().split('\n') - if lines: - # Return first non-empty line - for line in lines: - line = line.strip() - if line: - return line - return None - - def _save_last_repo(self, repo_id: str): - """Save the last repository for future push commands.""" - config_dir = Path.home() / ".hfjobs" - config_dir.mkdir(exist_ok=True) - - config_file = config_dir / "last_uv_repo" - config_file.write_text(repo_id) - - def _get_last_repo(self) -> Optional[str]: - """Get the last repository used.""" - config_file = Path.home() / ".hfjobs" / "last_uv_repo" - if config_file.exists(): - return config_file.read_text().strip() - return None \ No newline at end of file diff --git a/hfjobs/commands/uv.py b/hfjobs/commands/uv.py new file mode 100644 index 0000000..78ec023 --- /dev/null +++ b/hfjobs/commands/uv.py @@ -0,0 +1,840 @@ +"""UV commands for hfjobs.""" + +import hashlib +import os +import re +import sys +from argparse import Namespace +from datetime import datetime +from pathlib import Path +from typing import Optional, List, Dict, Any + +from huggingface_hub import HfApi, create_repo +from huggingface_hub.utils import RepositoryNotFoundError + +from . import BaseCommand +from .run import RunCommand + + +class UvCommand(BaseCommand): + """Manage UV scripts on Hugging Face Hub.""" + + @staticmethod + def register_subcommand(parser): + """Register UV subcommands.""" + uv_parser = parser.add_parser( + "uv", + help="Share and manage UV scripts on Hugging Face Hub", + description="Commands for sharing UV scripts as Hugging Face datasets" + ) + + subparsers = uv_parser.add_subparsers( + dest="uv_command", + help="UV commands", + required=True + ) + + # Init command + init_parser = subparsers.add_parser( + "init", + help="Initialize a new UV script repository", + description="Create a Hugging Face dataset repository for sharing UV scripts" + ) + init_parser.add_argument( + "repo", + help="Repository name (e.g., 'username/my-script' or just 'my-script')" + ) + init_parser.add_argument( + "script", + nargs="?", + help="UV script to upload (creates template if not provided)" + ) + init_parser.add_argument( + "--private", + action="store_true", + help="Make the repository private" + ) + init_parser.set_defaults(func=UvCommand) + + # Push command + push_parser = subparsers.add_parser( + "push", + help="Push a UV script to an existing repository", + description="Update or add UV scripts to a repository" + ) + push_parser.add_argument( + "script", + help="UV script to push" + ) + push_parser.add_argument( + "--repo", + help="Repository to push to (uses last initialized repo if not specified)" + ) + push_parser.set_defaults(func=UvCommand) + + # Sync command + sync_parser = subparsers.add_parser( + "sync", + help="Sync local scripts to repository", + description="Sync all Python scripts from local directory to HF repository" + ) + sync_parser.add_argument( + "files", + nargs="*", + help="Specific files to sync (default: all .py files)" + ) + sync_parser.add_argument( + "--dry-run", + action="store_true", + help="Show what would be synced without uploading" + ) + sync_parser.set_defaults(func=UvCommand) + + # Run command + run_parser = subparsers.add_parser( + "run", + help="Run a UV script on HF infrastructure", + description="Upload and execute a UV script using hfjobs" + ) + run_parser.add_argument("script", help="UV script to run") + run_parser.add_argument("script_args", nargs="*", help="Arguments for the script", default=[]) + run_parser.add_argument("--repo", help="Repository for the script") + run_parser.add_argument("--python", default="3.12", help="Python version") + run_parser.add_argument("--flavor", default="cpu-basic", help="Hardware flavor") + run_parser.add_argument("-e", "--env", action="append", help="Environment variables") + run_parser.add_argument("-s", "--secret", action="append", help="Secret environment variables") + run_parser.add_argument("--timeout", help="Max duration") + run_parser.add_argument("-d", "--detach", action="store_true", help="Run in background") + run_parser.add_argument("--token", help="HF token") + run_parser.set_defaults(func=UvCommand) + + def __init__(self, args): + """Initialize the command with parsed arguments.""" + self.args = args + + def run(self): + """Execute UV command.""" + if self.args.uv_command == "init": + self._init_repo(self.args) + elif self.args.uv_command == "push": + self._push_script(self.args) + elif self.args.uv_command == "sync": + self._sync_scripts(self.args) + elif self.args.uv_command == "run": + self._run_script(self.args) + + def _init_repo(self, args): + """Initialize a new UV script repository.""" + api = HfApi() + + # Ensure repo name includes username + repo_id = args.repo + if "/" not in repo_id: + user_info = api.whoami() + username = user_info["name"] + repo_id = f"{username}/{repo_id}" + + # Create local directory + local_dir = Path(repo_id.split('/')[-1]) + if local_dir.exists(): + print(f"Error: Directory '{local_dir}' already exists") + return + + # Create repository + print(f"Creating repository: {repo_id}") + try: + create_repo( + repo_id, + repo_type="dataset", + private=args.private, + exist_ok=False + ) + except Exception as e: + if "already exists" in str(e): + print(f"Error: Repository {repo_id} already exists") + return + raise + + # Create local directory structure + print(f"Creating local directory: {local_dir}") + local_dir.mkdir(parents=True) + config_dir = local_dir / ".hfjobs" + config_dir.mkdir() + + # Save config + config_file = config_dir / "config" + config_file.write_text(f"repo={repo_id}\n") + + # Upload script or create template + if args.script: + script_path = Path(args.script) + if not script_path.exists(): + print(f"Error: Script not found: {args.script}") + return + + script_name = script_path.name + print(f"Uploading script: {script_name}") + + # Read script content + with open(script_path, 'r') as f: + script_content = f.read() + + # Save to local directory + local_script = local_dir / script_name + local_script.write_text(script_content) + + # Upload script + api.upload_file( + path_or_fileobj=script_content.encode(), + path_in_repo=script_name, + repo_id=repo_id, + repo_type="dataset" + ) + else: + # Create a template script + script_name = "script.py" + script_content = self._create_template_script() + + print(f"Creating template script: {script_name}") + + # Save to local directory + local_script = local_dir / script_name + local_script.write_text(script_content) + + api.upload_file( + path_or_fileobj=script_content.encode(), + path_in_repo=script_name, + repo_id=repo_id, + repo_type="dataset" + ) + + # Create README + print("Creating README with usage instructions") + readme_content = self._create_readme(repo_id, script_name, script_content) + + # Save README locally + local_readme = local_dir / "README.md" + local_readme.write_text(readme_content) + + api.upload_file( + path_or_fileobj=readme_content.encode(), + path_in_repo="README.md", + repo_id=repo_id, + repo_type="dataset" + ) + + # Save last repo for push command + self._save_last_repo(repo_id) + + print(f"\nāœ… Created local directory: {local_dir}") + print(f"āœ… Script published to: https://huggingface.co/datasets/{repo_id}") + print(f"\nRun your script with:") + print(f"hfjobs run ghcr.io/astral-sh/uv:python3.12 \\") + print(f" uv run https://huggingface.co/datasets/{repo_id}/resolve/main/{script_name} \\") + print(f" ") + print(f"\nLocal directory: cd {local_dir}") + + def _push_script(self, args): + """Push a script to an existing repository.""" + api = HfApi() + + # Check if we're in a local directory with config + local_dir = None + config_file = Path(".hfjobs/config") + if config_file.exists(): + # We're in a scripts directory + local_dir = Path.cwd() + # Read repo from config + config_content = config_file.read_text() + for line in config_content.splitlines(): + if line.startswith("repo="): + repo_id = line.split("=", 1)[1] + break + else: + # Get repository from args or last repo + repo_id = args.repo or self._get_last_repo() + if not repo_id: + print("Error: No repository specified and no previous repository found") + print("Use --repo to specify a repository or run 'hfjobs scripts init' first") + return + + # Check script exists + script_path = Path(args.script) + if not script_path.exists(): + print(f"Error: Script not found: {args.script}") + return + + # Check repository exists + try: + api.repo_info(repo_id, repo_type="dataset") + except RepositoryNotFoundError: + print(f"Error: Repository not found: {repo_id}") + return + + script_name = script_path.name + + # Read script content + with open(script_path, 'r') as f: + script_content = f.read() + + print(f"Uploading: {script_name}") + + # Save locally if in a local directory + if local_dir and script_path.parent != local_dir: + local_script = local_dir / script_name + local_script.write_text(script_content) + print(f"Saved locally: {local_script}") + + # Upload script + api.upload_file( + path_or_fileobj=script_content.encode(), + path_in_repo=script_name, + repo_id=repo_id, + repo_type="dataset" + ) + + # Update README + print("Updating README...") + try: + # Download existing README + readme_path = api.hf_hub_download( + repo_id=repo_id, + filename="README.md", + repo_type="dataset" + ) + with open(readme_path, 'r') as f: + readme_content = f.read() + + # Update README if this is a new script + if script_name not in readme_content: + readme_content = self._update_readme_for_new_script( + readme_content, repo_id, script_name, script_content + ) + + api.upload_file( + path_or_fileobj=readme_content.encode(), + path_in_repo="README.md", + repo_id=repo_id, + repo_type="dataset" + ) + except Exception as e: + print(f"Warning: Could not update README: {e}") + + print(f"āœ… Script added to repository") + print(f"View at: https://huggingface.co/datasets/{repo_id}") + + def _sync_scripts(self, args): + """Sync local scripts to remote repository.""" + api = HfApi() + + # Check if we're in a scripts directory with config + config_file = Path(".hfjobs/config") + if not config_file.exists(): + # Try parent directory + config_file = Path("../.hfjobs/config") + if not config_file.exists(): + print("Error: Not in a UV directory. Run from within a directory created by 'hfjobs uv init'") + return + + # Read config + config_content = config_file.read_text() + repo_id = None + for line in config_content.splitlines(): + if line.startswith("repo="): + repo_id = line.split("=", 1)[1] + break + + if not repo_id: + print("Error: Could not find repository in config") + return + + # Get local directory + local_dir = config_file.parent.parent + + # Find files to sync + if args.files: + # Specific files provided + files_to_sync = [] + for file_pattern in args.files: + files_to_sync.extend(local_dir.glob(file_pattern)) + else: + # Default to all Python files + files_to_sync = list(local_dir.glob("*.py")) + + if not files_to_sync: + print("No files to sync") + return + + print(f"Repository: {repo_id}") + print(f"Files to sync:") + for file in files_to_sync: + print(f" - {file.name}") + + if args.dry_run: + print("\n--dry-run specified, not uploading") + return + + # Upload each file + for file_path in files_to_sync: + if file_path.is_file() and not file_path.name.startswith('.'): + print(f"\nUploading: {file_path.name}") + with open(file_path, 'r') as f: + content = f.read() + + api.upload_file( + path_or_fileobj=content.encode(), + path_in_repo=file_path.name, + repo_id=repo_id, + repo_type="dataset" + ) + + # Update README if there are new scripts + print("\nUpdating README...") + try: + # Download current README + readme_path = api.hf_hub_download( + repo_id=repo_id, + filename="README.md", + repo_type="dataset" + ) + with open(readme_path, 'r') as f: + readme_content = f.read() + + # Check if any scripts are missing from README + updated = False + for file_path in files_to_sync: + if file_path.suffix == '.py' and file_path.name not in readme_content: + with open(file_path, 'r') as f: + script_content = f.read() + readme_content = self._update_readme_for_new_script( + readme_content, repo_id, file_path.name, script_content + ) + updated = True + + if updated: + # Save updated README locally + local_readme = local_dir / "README.md" + local_readme.write_text(readme_content) + + # Upload updated README + api.upload_file( + path_or_fileobj=readme_content.encode(), + path_in_repo="README.md", + repo_id=repo_id, + repo_type="dataset" + ) + print("āœ… README updated") + except Exception as e: + print(f"Warning: Could not update README: {e}") + + print(f"\nāœ… Sync complete: https://huggingface.co/datasets/{repo_id}") + + def _run_script(self, args): + """Run a UV script on HF infrastructure.""" + api = HfApi() + + # Check script exists + script_path = Path(args.script) + if not script_path.exists(): + print(f"Error: Script not found: {args.script}") + return + + # Determine repository + repo_id = self._determine_repository(args) + is_ephemeral = args.repo is None and not Path(".hfjobs/config").exists() + + # Create repo if needed + try: + api.repo_info(repo_id, repo_type="dataset") + print(f"Using existing repository: {repo_id}") + except RepositoryNotFoundError: + print(f"Creating repository: {repo_id}") + create_repo(repo_id, repo_type="dataset", exist_ok=True) + + # Upload script + print(f"Uploading {script_path.name}...") + with open(script_path, 'r') as f: + script_content = f.read() + + # For MVP, just use original filename + filename = script_path.name + + api.upload_file( + path_or_fileobj=script_content.encode(), + path_in_repo=filename, + repo_id=repo_id, + repo_type="dataset" + ) + + script_url = f"https://huggingface.co/datasets/{repo_id}/resolve/main/{filename}" + repo_url = f"https://huggingface.co/datasets/{repo_id}" + print(f"āœ“ Script uploaded to: {script_url}") + print(f"āœ“ Repository: {repo_url}") + + # Create and upload README + if is_ephemeral: + print(f"āœ“ Temporary repository created: {repo_id}") + # Create minimal README for ephemeral repo + readme_content = self._create_minimal_readme(repo_id, filename, script_content) + else: + # For persistent repos, check if README exists and update it + try: + # Try to download existing README + readme_path = api.hf_hub_download( + repo_id=repo_id, + filename="README.md", + repo_type="dataset" + ) + with open(readme_path, 'r') as f: + existing_readme = f.read() + # Update existing README with new script + readme_content = self._update_readme_with_script(repo_id, filename, script_content, existing_readme) + except Exception: + # No existing README, create new one + readme_content = self._create_readme(repo_id, filename, script_content) + + # Upload README + api.upload_file( + path_or_fileobj=readme_content.encode(), + path_in_repo="README.md", + repo_id=repo_id, + repo_type="dataset" + ) + + # Prepare docker image + docker_image = f"ghcr.io/astral-sh/uv:python{args.python}-bookworm-slim" + + # Build command + command = ["uv", "run", script_url] + args.script_args + + # Create RunCommand args + run_args = Namespace( + dockerImage=docker_image, + command=command, + env=args.env, + secret=args.secret, + env_file=None, # Not supported in MVP + secret_env_file=None, # Not supported in MVP + flavor=args.flavor, + timeout=args.timeout, + detach=args.detach, + token=args.token + ) + + print("Starting job on HF infrastructure...") + RunCommand(run_args).run() + + def _create_template_script(self) -> str: + """Create a template UV script.""" + return '''# /// script +# requires-python = ">=3.10" +# dependencies = [ +# "datasets", +# "tqdm", +# ] +# /// +"""Template UV script for hfjobs. + +This is a template script. Customize it for your needs! + +Usage: + python script.py [--option value] +""" + +import argparse +from datasets import load_dataset +from tqdm import tqdm + + +def main(): + parser = argparse.ArgumentParser(description="Template UV script") + parser.add_argument("input", help="Input dataset") + parser.add_argument("output", help="Output dataset") + parser.add_argument("--text-column", default="text", help="Text column name") + + args = parser.parse_args() + + print(f"Loading dataset: {args.input}") + dataset = load_dataset(args.input, split="train") + + # Your processing logic here + print(f"Processing {len(dataset)} examples...") + + # Example: simple transformation + def process_example(example): + # Add your transformation logic + return example + + processed = dataset.map(process_example, desc="Processing") + + print(f"Pushing to: {args.output}") + processed.push_to_hub(args.output) + + print("āœ… Done!") + + +if __name__ == "__main__": + main() +''' + + def _create_readme(self, repo_id: str, script_name: str, script_content: str) -> str: + """Create README content for the repository.""" + # Extract script info + description = self._extract_description(script_content) or "UV script" + + readme = f"""--- +tags: +- hfjobs-uv-script +- uv +- python +viewer: false +--- + +# {repo_id.split('/')[-1]} + +A collection of UV scripts for hfjobs. + +## Usage + +Run any script using: +```bash +hfjobs uv run --repo {repo_id.split('/')[-1]} +``` + +## Scripts + + +| Script | Description | Command | +|--------|-------------|---------| +| [{script_name}](./blob/main/{script_name}) | {description} | `hfjobs uv run {script_name} --repo {repo_id.split('/')[-1]}` | + + +## Learn More + +Learn more about UV scripts in the [UV documentation](https://docs.astral.sh/uv/guides/scripts/). + +--- +*Created with [hfjobs](https://github.com/huggingface/hfjobs)* +""" + + return readme + + def _create_minimal_readme(self, repo_id: str, script_name: str, script_content: str) -> str: + """Create minimal README content for ephemeral repositories.""" + # Extract script info + description = self._extract_description(script_content) + timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S UTC") + + readme = f"""--- +tags: +- hfjobs-uv-script +- ephemeral +- uv +- python +viewer: false +--- + +# Ephemeral UV Script Repository + +This is a temporary repository created by `hfjobs uv run` for one-time script execution. + +**Script:** `{script_name}` +**Created:** {timestamp} +""" + + if description: + readme += f"**Description:** {description}\n" + + readme += f""" +## Direct Execution + +This script was executed using: +```bash +hfjobs uv run {script_name} +``` + +## Script URL + +``` +https://huggingface.co/datasets/{repo_id}/resolve/main/{script_name} +``` + +--- +*Created with [hfjobs](https://github.com/huggingface/hfjobs)* +""" + + return readme + + def _update_readme_with_script(self, repo_id: str, script_name: str, script_content: str, existing_readme: str) -> str: + """Update existing README with a new script entry.""" + description = self._extract_description(script_content) or "UV script" + + # Check if script is already in the table + if f"| [{script_name}]" in existing_readme: + # Script already exists, don't add duplicate + return existing_readme + + # Find the auto-generated section + start_marker = "" + end_marker = "" + + start_idx = existing_readme.find(start_marker) + end_idx = existing_readme.find(end_marker) + + if start_idx == -1 or end_idx == -1: + # Markers not found, fallback to creating new README + return self._create_readme(repo_id, script_name, script_content) + + # Extract the table content + table_start = existing_readme.find("\n", start_idx) + 1 + table_end = existing_readme.rfind("\n", 0, end_idx) + + # Add new row to the table + new_row = f"| [{script_name}](./blob/main/{script_name}) | {description} | `hfjobs uv run {script_name} --repo {repo_id.split('/')[-1]}` |" + + # Reconstruct README with new script + updated_readme = ( + existing_readme[:table_end] + + "\n" + new_row + + existing_readme[table_end:] + ) + + return updated_readme + + def _update_readme_for_new_script( + self, readme: str, repo_id: str, script_name: str, script_content: str + ) -> str: + """Update README when adding a new script.""" + # Extract script info + description = self._extract_description(script_content) + + # Find where to insert new script info + if "## Scripts" not in readme: + # Add Scripts section before the footer + footer_marker = "---\n*Created with" + if footer_marker in readme: + before_footer = readme.split(footer_marker)[0] + footer = footer_marker + readme.split(footer_marker)[1] + else: + before_footer = readme + footer = "\n---\n*Created with [hfjobs](https://github.com/huggingface/hfjobs)*\n" + + scripts_section = "\n## Scripts\n\n" + else: + # Insert into existing Scripts section + parts = readme.split("## Scripts") + before_scripts = parts[0] + "## Scripts" + + # Find next section or footer + remaining = parts[1] + next_section_match = re.search(r'\n## ', remaining) + footer_match = re.search(r'\n---\n\*Created with', remaining) + + if next_section_match: + scripts_content = remaining[:next_section_match.start()] + after_scripts = remaining[next_section_match.start():] + elif footer_match: + scripts_content = remaining[:footer_match.start()] + after_scripts = remaining[footer_match.start():] + else: + scripts_content = remaining + after_scripts = "" + + before_footer = before_scripts + scripts_content + footer = after_scripts + scripts_section = "" + + # Add new script info + script_info = f""" +### {script_name} +""" + if description: + script_info += f"{description}\n\n" + + script_info += f"""```bash +hfjobs run ghcr.io/astral-sh/uv:python3.12 \\ + uv run https://huggingface.co/datasets/{repo_id}/resolve/main/{script_name} +``` +""" + + return before_footer + scripts_section + script_info + footer + + def _extract_dependencies(self, script_content: str) -> List[str]: + """Extract dependencies from UV script header.""" + deps = [] + in_deps = False + + for line in script_content.split('\n'): + if 'dependencies = [' in line: + in_deps = True + continue + if in_deps: + if ']' in line: + break + dep_match = re.search(r'"([^"]+)"', line) + if dep_match: + deps.append(dep_match.group(1)) + + return deps + + def _extract_description(self, script_content: str) -> Optional[str]: + """Extract description from script docstring.""" + # Look for docstring + docstring_match = re.search(r'"""(.*?)"""', script_content, re.DOTALL) + if docstring_match: + lines = docstring_match.group(1).strip().split('\n') + if lines: + # Return first non-empty line + for line in lines: + line = line.strip() + if line: + return line + return None + + def _save_last_repo(self, repo_id: str): + """Save the last repository for future push commands.""" + config_dir = Path.home() / ".hfjobs" + config_dir.mkdir(exist_ok=True) + + config_file = config_dir / "last_uv_repo" + config_file.write_text(repo_id) + + def _get_last_repo(self) -> Optional[str]: + """Get the last repository used.""" + config_file = Path.home() / ".hfjobs" / "last_uv_repo" + if config_file.exists(): + return config_file.read_text().strip() + return None + + def _determine_repository(self, args) -> str: + """Determine which repository to use for the script.""" + api = HfApi() + + # Check local directory first + config_file = Path(".hfjobs/config") + if config_file.exists(): + config_content = config_file.read_text() + for line in config_content.splitlines(): + if line.startswith("repo="): + repo_id = line.split("=", 1)[1] + print(f"Using repository from local config: {repo_id}") + return repo_id + + # Use provided repo + if args.repo: + repo_id = args.repo + if "/" not in repo_id: + username = api.whoami()["name"] + repo_id = f"{username}/{repo_id}" + return repo_id + + # Create ephemeral repo + username = api.whoami()["name"] + timestamp = datetime.now().strftime("%Y%m%d-%H%M%S") + + # Simple hash for uniqueness + script_hash = hashlib.md5( + Path(args.script).read_bytes() + ).hexdigest()[:8] + + return f"{username}/hfjobs-uv-run-{timestamp}-{script_hash}" \ No newline at end of file From 4b7135f6b1367bedc4bc4af6007e97ea6ec02e64 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Fri, 4 Jul 2025 11:16:35 +0100 Subject: [PATCH 16/16] chore: remove docs for PoC PR - Reset README.md to upstream version - Remove uv-script-sharing.md documentation - Documentation will be provided in PR description --- README.md | 29 +------- docs/uv-script-sharing.md | 138 -------------------------------------- 2 files changed, 1 insertion(+), 166 deletions(-) delete mode 100644 docs/uv-script-sharing.md diff --git a/README.md b/README.md index df56af3..ae376c8 100644 --- a/README.md +++ b/README.md @@ -14,14 +14,13 @@ pip install hfjobs usage: hfjobs [] positional arguments: - {inspect,logs,ps,run,cancel,scripts} + {inspect,logs,ps,run,cancel} hfjobs command helpers inspect Display detailed information on one or more Jobs logs Fetch the logs of a Job ps List Jobs run Run a Job cancel Cancel a Job - scripts Share and manage UV scripts on Hugging Face Hub options: -h, --help show this help message and exit @@ -96,29 +95,3 @@ Available `--flavor` options: - TPU: `v5e-1x1`, `v5e-2x2`, `v5e-2x4` (updated in 03/25 from Hugging Face [suggested_hardware docs](https://huggingface.co/docs/hub/en/spaces-config-reference)) - -## UV Script Sharing - -Share and run UV scripts easily on the Hugging Face Hub: - -### Share a script - -```bash -# Create a new repository with your UV script -hfjobs scripts init my-awesome-script my-script.py - -# Or create a template to get started -hfjobs scripts init my-new-script -``` - -### Run shared scripts - -Once shared, anyone can run your script: - -```bash -hfjobs run ghcr.io/astral-sh/uv:python3.12 \ - uv run https://huggingface.co/datasets/username/my-script/resolve/main/script.py \ - -``` - -See the [UV script sharing guide](docs/uv-script-sharing.md) for more details. diff --git a/docs/uv-script-sharing.md b/docs/uv-script-sharing.md deleted file mode 100644 index c349c9e..0000000 --- a/docs/uv-script-sharing.md +++ /dev/null @@ -1,138 +0,0 @@ -# UV Script Sharing with hfjobs - -This guide explains how to share UV scripts on the Hugging Face Hub using the new `hfjobs scripts` commands. - -## Overview - -The `hfjobs scripts` commands provide a simple way to: -- Share UV scripts as Hugging Face dataset repositories -- Make scripts easily discoverable with standardized tags -- Generate usage instructions automatically -- Run shared scripts with a simple copy-paste command - -## Quick Start - -### 1. Share a UV Script - -To share an existing UV script: - -```bash -hfjobs scripts init my-awesome-script my-script.py -``` - -This will: -- Create a new dataset repository (e.g., `username/my-awesome-script`) -- Upload your UV script -- Generate a README with usage instructions -- Tag the repository with `hfjobs-uv-script` for discovery - -### 2. Create a Template Script - -If you don't have a script yet, omit the script argument to create a template: - -```bash -hfjobs scripts init my-new-script -``` - -This creates a repository with a template UV script that you can customize. - -### 3. Add More Scripts - -To add additional scripts to an existing repository: - -```bash -hfjobs scripts push another-script.py -``` - -The README will be automatically updated with the new script. - -## Running Shared Scripts - -Once a script is shared, anyone can run it using the command shown in the repository README: - -```bash -hfjobs run ghcr.io/astral-sh/uv:python3.12 \ - uv run https://huggingface.co/datasets/username/my-script/resolve/main/script.py \ - -``` - -## Example UV Script - -Here's a simple UV script that filters a dataset by text length: - -```python -# /// script -# requires-python = ">=3.10" -# dependencies = [ -# "datasets", -# "pandas", -# ] -# /// -"""Filter dataset by text length.""" - -import argparse -from datasets import load_dataset - -def main(): - parser = argparse.ArgumentParser(description="Filter dataset by text length") - parser.add_argument("input_dataset", help="Input dataset from HF Hub") - parser.add_argument("output_dataset", help="Output dataset name") - parser.add_argument("--min-length", type=int, default=10, help="Minimum text length") - - args = parser.parse_args() - - dataset = load_dataset(args.input_dataset, split="train") - filtered = dataset.filter(lambda x: len(x["text"]) >= args.min_length) - filtered.push_to_hub(args.output_dataset) - print("āœ… Done!") - -if __name__ == "__main__": - main() -``` - -## Repository Structure - -A UV script repository has a simple structure: - -``` -username/my-script/ -ā”œā”€ā”€ script.py # Your UV script(s) -└── README.md # Auto-generated usage instructions -``` - -## Discovery - -All scripts shared with `hfjobs scripts init` are automatically tagged with: -- `hfjobs-uv-script` -- `uv` -- `python` - -This makes them easy to find on the Hugging Face Hub. - -## Best Practices - -1. **Use descriptive repository names** - Make it clear what your script does -2. **Add docstrings** - The first line becomes the description in the README -3. **Include usage examples** - Add examples in your script's docstring -4. **Specify dependencies clearly** - Use the UV script header format -5. **Test locally first** - Ensure your script works before sharing - -## Private Scripts - -To create a private repository for internal use: - -```bash -hfjobs scripts init my-private-script script.py --private -``` - -## Tips - -- The last initialized repository is remembered, so you can use `hfjobs scripts push` without specifying `--repo` -- Script dependencies are automatically extracted and shown in the README -- Multiple scripts in one repository are supported and organized in the README - -## Next Steps - -- Browse existing UV scripts: Search for the `hfjobs-uv-script` tag on [Hugging Face Hub](https://huggingface.co/datasets) -- Share your own scripts to help the community -- Contribute improvements to the [hfjobs repository](https://github.com/huggingface/hfjobs) \ No newline at end of file