diff --git a/doc/source/cluster/vms/getting-started.rst b/doc/source/cluster/vms/getting-started.rst index 7dc5cb3109dd..716555d3bfea 100644 --- a/doc/source/cluster/vms/getting-started.rst +++ b/doc/source/cluster/vms/getting-started.rst @@ -29,43 +29,50 @@ Setup Before we start, you will need to install some Python dependencies as follows: -.. tabs:: +.. tab-set:: - .. tab:: Ray Team Supported + .. tab-item:: Ray Team Supported + :sync: Ray Team Supported - .. tabs:: + .. tab-set:: - .. tab:: AWS + .. tab-item:: AWS + :sync: AWS .. code-block:: shell $ pip install -U "ray[default]" boto3 - .. tab:: GCP + .. tab-item:: GCP + :sync: GCP .. code-block:: shell $ pip install -U "ray[default]" google-api-python-client - .. tab:: Community Supported + .. tab-item:: Community Supported + :sync: Community Supported - .. tabs:: + .. tab-set:: - .. tab:: Azure + .. tab-item:: Azure + :sync: Azure .. code-block:: shell $ pip install -U "ray[default]" azure-cli azure-core - .. tab:: Aliyun + .. tab-item:: Aliyun + :sync: Aliyun .. code-block:: shell $ pip install -U "ray[default]" aliyun-python-sdk-core aliyun-python-sdk-ecs - + Aliyun Cluster Launcher Maintainers (GitHub handles): @zhuangzhuang131419, @chenk008 - .. tab:: vSphere + .. tab-item:: vSphere + :sync: vSphere .. code-block:: shell @@ -76,36 +83,43 @@ Before we start, you will need to install some Python dependencies as follows: Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials: -.. tabs:: +.. tab-set:: - .. tab:: Ray Team Supported + .. tab-item:: Ray Team Supported + :sync: Ray Team Supported - .. tabs:: + .. tab-set:: - .. tab:: AWS + .. tab-item:: AWS + :sync: AWS Configure your credentials in ``~/.aws/credentials`` as described in `the AWS docs `_. - .. tab:: GCP + .. tab-item:: GCP + :sync: GCP Set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as described in `the GCP docs `_. - .. tab:: Community Supported + .. tab-item:: Community Supported + :sync: Community Supported - .. tabs:: + .. tab-set:: - .. tab:: Azure + .. tab-item:: Azure + :sync: Azure Log in using ``az login``, then configure your credentials with ``az account set -s ``. - .. tab:: Aliyun + .. tab-item:: Aliyun + :sync: Aliyun Obtain and set the AccessKey pair of the Aliyun account as described in `the docs `__. Make sure to grant the necessary permissions to the RAM user and set the AccessKey pair in your cluster config file. Refer to the provided `aliyun/example-full.yaml `__ for a sample cluster config. - .. tab:: vSphere + .. tab-item:: vSphere + :sync: vSphere .. code-block:: shell @@ -205,18 +219,21 @@ To start a Ray Cluster, first we need to define the cluster configuration. The c A minimal sample cluster configuration file looks as follows: -.. tabs:: +.. tab-set:: - .. tab:: Ray Team Supported + .. tab-item:: Ray Team Supported + :sync: Ray Team Supported - .. tabs:: + .. tab-set:: - .. tab:: AWS + .. tab-item:: AWS + :sync: AWS .. literalinclude:: ../../../../python/ray/autoscaler/aws/example-minimal.yaml :language: yaml - .. tab:: GCP + .. tab-item:: GCP + :sync: GCP .. code-block:: yaml @@ -228,11 +245,13 @@ A minimal sample cluster configuration file looks as follows: type: gcp region: us-west1 - .. tab:: Community Supported + .. tab-item:: Community Supported + :sync: Community Supported - .. tabs:: + .. tab-set:: - .. tab:: Azure + .. tab-item:: Azure + :sync: Azure .. code-block:: yaml @@ -254,13 +273,15 @@ A minimal sample cluster configuration file looks as follows: # changes to this should match what is specified in file_mounts ssh_public_key: ~/.ssh/id_rsa.pub - .. tab:: Aliyun + .. tab-item:: Aliyun + :sync: Aliyun - Please refer to `example-full.yaml `__. + Please refer to `example-full.yaml `__. Make sure your account balance is not less than 100 RMB, otherwise you will receive the error `InvalidAccountStatus.NotEnoughBalance`. - .. tab:: vSphere + .. tab-item:: vSphere + :sync: vSphere .. literalinclude:: ../../../../python/ray/autoscaler/vsphere/example-minimal.yaml :language: yaml diff --git a/doc/source/data/batch_inference.rst b/doc/source/data/batch_inference.rst index 32deafe2930a..92c616f57076 100644 --- a/doc/source/data/batch_inference.rst +++ b/doc/source/data/batch_inference.rst @@ -34,9 +34,10 @@ Using Ray Data for offline inference involves four basic steps: For more in-depth examples for your use case, see :ref:`the batch inference examples`. For how to configure batch inference, see :ref:`the configuration guide`. -.. tabs:: +.. tab-set:: - .. group-tab:: HuggingFace + .. tab-item:: HuggingFace + :sync: HuggingFace .. testcode:: @@ -84,7 +85,8 @@ For how to configure batch inference, see :ref:`the configuration guide`. -.. tabs:: +.. tab-set:: - .. group-tab:: HuggingFace + .. tab-item:: HuggingFace + :sync: HuggingFace .. testcode:: @@ -246,7 +250,8 @@ The remaining is the same as the :ref:`Quickstart `. {'data': 'Complete this', 'output': 'Complete this poll. Which one do you think holds the most promise for you?\n\nThank you'} - .. group-tab:: PyTorch + .. tab-item:: PyTorch + :sync: PyTorch .. testcode:: @@ -292,7 +297,8 @@ The remaining is the same as the :ref:`Quickstart `. {'output': array([0.5590901], dtype=float32)} - .. group-tab:: TensorFlow + .. tab-item:: TensorFlow + :sync: TensorFlow .. testcode:: @@ -471,7 +477,7 @@ In this case, use :meth:`XGBoostTrainer.get_model() `_. .. testcode:: - + from typing import Dict import pandas as pd import numpy as np @@ -485,18 +491,18 @@ The rest of the logic looks the same as in the `Quickstart <#quickstart>`_. class XGBoostPredictor: def __init__(self, checkpoint: Checkpoint): self.model = XGBoostTrainer.get_model(checkpoint) - + def __call__(self, data: pd.DataFrame) -> Dict[str, np.ndarray]: dmatrix = xgboost.DMatrix(data) return {"predictions": self.model.predict(dmatrix)} - - + + # Use 2 parallel actors for inference. Each actor predicts on a # different partition of data. scale = ray.data.ActorPoolStrategy(size=2) # Map the Predictor over the Dataset to get predictions. predictions = test_dataset.map_batches( - XGBoostPredictor, + XGBoostPredictor, compute=scale, batch_format="pandas", # Pass in the Checkpoint to the XGBoostPredictor constructor. diff --git a/doc/source/ray-observability/user-guides/cli-sdk.rst b/doc/source/ray-observability/user-guides/cli-sdk.rst index 11f1867acaeb..61f82e54b9b7 100644 --- a/doc/source/ray-observability/user-guides/cli-sdk.rst +++ b/doc/source/ray-observability/user-guides/cli-sdk.rst @@ -6,7 +6,7 @@ Monitoring with the CLI or SDK Monitoring and debugging capabilities in Ray are available through a CLI or SDK. -CLI command ``ray status`` +CLI command ``ray status`` ---------------------------- You can monitor node status and resource usage by running the CLI command, ``ray status``, on the head node. It displays @@ -102,9 +102,10 @@ This example uses the following script that runs two Tasks and creates two Actor See the summarized states of tasks. If it doesn't return the output immediately, retry the command. -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash @@ -126,7 +127,8 @@ See the summarized states of tasks. If it doesn't return the output immediately, 0 task_running_300_seconds RUNNING: 2 NORMAL_TASK 1 Actor.__init__ FINISHED: 2 ACTOR_CREATION_TASK - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -139,9 +141,10 @@ See the summarized states of tasks. If it doesn't return the output immediately, List all Actors. -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash @@ -160,7 +163,8 @@ List all Actors. 0 31405554844820381c2f0f8501000000 Actor 96956 ALIVE 1 f36758a9f8871a9ca993b1d201000000 Actor 96955 ALIVE - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -174,9 +178,10 @@ List all Actors. Get the state of a single Task using the get API. -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash @@ -196,7 +201,8 @@ Get the state of a single Task using the get API. serialized_runtime_env: '{}' state: ALIVE - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True @@ -207,9 +213,10 @@ Get the state of a single Task using the get API. Access logs through the ``ray logs`` API. -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash @@ -224,7 +231,8 @@ Access logs through the ``ray logs`` API. :actor_name:Actor Actor created - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True @@ -260,15 +268,17 @@ you can use ``list`` or ``get`` APIs to get more details for an individual abnor **Summarize all actors** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray summary actors - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -281,15 +291,17 @@ you can use ``list`` or ``get`` APIs to get more details for an individual abnor **Summarize all tasks** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray summary tasks - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -308,15 +320,17 @@ you can use ``list`` or ``get`` APIs to get more details for an individual abnor To get callsite info, set env variable `RAY_record_ref_creation_sites=1` when starting the Ray Cluster RAY_record_ref_creation_sites=1 ray start --head -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray summary objects - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -346,15 +360,17 @@ Get a list of resources. Possible resources include: **List all nodes** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray list nodes - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -363,15 +379,17 @@ Get a list of resources. Possible resources include: **List all placement groups** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray list placement-groups - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -383,15 +401,17 @@ Get a list of resources. Possible resources include: .. tip:: You can list resources with one or multiple filters: using `--filter` or `-f` -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray list objects -f pid= -f reference_type=LOCAL_REFERENCE - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -400,15 +420,17 @@ Get a list of resources. Possible resources include: **List alive actors** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray list actors -f state=ALIVE - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -417,15 +439,17 @@ Get a list of resources. Possible resources include: **List running tasks** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray list tasks -f state=RUNNING - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -434,15 +458,17 @@ Get a list of resources. Possible resources include: **List non-running tasks** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray list tasks -f state!=RUNNING - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -451,15 +477,17 @@ Get a list of resources. Possible resources include: **List running tasks that have a name func** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray list tasks -f state=RUNNING -f name="task_running_300_seconds()" - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -470,15 +498,17 @@ Get a list of resources. Possible resources include: .. tip:: When ``--detail`` is specified, the API can query more data sources to obtain state information in details. -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray list tasks --detail - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: @@ -493,15 +523,17 @@ Get the states of a particular entity (task, actor, etc.) **Get a task's states** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray get tasks - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True @@ -511,15 +543,17 @@ Get the states of a particular entity (task, actor, etc.) **Get a node's states** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray get nodes - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True @@ -540,15 +574,17 @@ By default, the API prints logs from a head node. **Get all retrievable log file names from a head node in a cluster** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray logs cluster - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True @@ -562,9 +598,10 @@ By default, the API prints logs from a head node. **Get a particular log file from a node** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash @@ -573,7 +610,8 @@ By default, the API prints logs from a head node. # `ray logs cluster` is alias to `ray logs` when querying with globs. ray logs gcs_server.out --node-id - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True @@ -586,9 +624,10 @@ By default, the API prints logs from a head node. **Stream a log file from a node** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash @@ -598,7 +637,8 @@ By default, the API prints logs from a head node. ray logs cluster raylet.out --node-ip --follow - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True @@ -612,15 +652,17 @@ By default, the API prints logs from a head node. **Stream log from an actor with actor id** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray logs actor --id= --follow - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True @@ -634,15 +676,17 @@ By default, the API prints logs from a head node. **Stream log from a pid** -.. tabs:: +.. tab-set:: - .. group-tab:: CLI (Recommended) + .. tab-item:: CLI (Recommended) + :sync: CLI (Recommended) .. code-block:: bash ray logs worker --pid= --follow - .. group-tab:: Python SDK (Internal Developer API) + .. tab-item:: Python SDK (Internal Developer API) + :sync: Python SDK (Internal Developer API) .. testcode:: :skipif: True diff --git a/doc/source/ray-overview/installation.rst b/doc/source/ray-overview/installation.rst index 70e0a56c9e95..d612dfa4a448 100644 --- a/doc/source/ray-overview/installation.rst +++ b/doc/source/ray-overview/installation.rst @@ -139,7 +139,7 @@ You can install the nightly Ray wheels via the following links. These daily rele .. note:: .. If you change the list of wheel links below, remember to update `get_wheel_filename()` in `https://github.com/ray-project/ray/blob/master/python/ray/_private/utils.py`. - + Python 3.11 support is experimental. .. _`Linux Python 3.11 (x86_64) (EXPERIMENTAL)`: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp311-cp311-manylinux2014_x86_64.whl @@ -518,15 +518,17 @@ required for Ray and its libraries. We publish the dependencies that are installed in our ``ray`` and ``ray-ml`` Docker images for Python 3.9. -.. tabs:: +.. tab-set:: - .. group-tab:: ray (Python 3.9) + .. tab-item:: ray (Python 3.9) + :sync: ray (Python 3.9) Ray version: nightly (`7b8ec8a `_) .. literalinclude:: ./pip_freeze_ray-py39-cpu.txt - .. group-tab:: ray-ml (Python 3.9) + .. tab-item:: ray-ml (Python 3.9) + :sync: ray-ml (Python 3.9) Ray version: nightly (`7b8ec8a `_) diff --git a/doc/source/train/deepspeed.rst b/doc/source/train/deepspeed.rst index 0cf85609c25e..6e91e7107714 100644 --- a/doc/source/train/deepspeed.rst +++ b/doc/source/train/deepspeed.rst @@ -37,7 +37,7 @@ You only need to run your existing training code with a TorchTrainer. You can ex # Start training ... - + from ray.train.torch import TorchTrainer from ray.train import ScalingConfig @@ -49,11 +49,11 @@ You only need to run your existing training code with a TorchTrainer. You can ex trainer.fit() -Below is a simple example of ZeRO-3 training with DeepSpeed only. +Below is a simple example of ZeRO-3 training with DeepSpeed only. -.. tabs:: +.. tab-set:: - .. group-tab:: Example with Ray Data + .. tab-item:: Example with Ray Data .. dropdown:: Show Code @@ -62,7 +62,7 @@ Below is a simple example of ZeRO-3 training with DeepSpeed only. :start-after: __deepspeed_torch_basic_example_start__ :end-before: __deepspeed_torch_basic_example_end__ - .. group-tab:: Example with PyTorch DataLoader + .. tab-item:: Example with PyTorch DataLoader .. dropdown:: Show Code @@ -73,9 +73,9 @@ Below is a simple example of ZeRO-3 training with DeepSpeed only. .. tip:: - To run DeepSpeed with pure PyTorch, you **don't need to** provide any additional Ray Train utilities - like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training funciton. Instead, - keep using `deepspeed.initialize() `_ as usual to prepare everything + To run DeepSpeed with pure PyTorch, you **don't need to** provide any additional Ray Train utilities + like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training funciton. Instead, + keep using `deepspeed.initialize() `_ as usual to prepare everything for distributed training. Run DeepSpeed with other frameworks diff --git a/doc/source/train/getting-started-pytorch-lightning.rst b/doc/source/train/getting-started-pytorch-lightning.rst index 68062ecad67c..61823ec0db1c 100644 --- a/doc/source/train/getting-started-pytorch-lightning.rst +++ b/doc/source/train/getting-started-pytorch-lightning.rst @@ -36,9 +36,9 @@ For reference, the final code is as follows: Compare a PyTorch Lightning training script with and without Ray Train. -.. tabs:: +.. tab-set:: - .. group-tab:: PyTorch Lightning + .. tab-item:: PyTorch Lightning .. This snippet isn't tested because it doesn't use any Ray code. @@ -59,17 +59,17 @@ Compare a PyTorch Lightning training script with and without Ray Train. self.model = resnet18(num_classes=10) self.model.conv1 = torch.nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) self.criterion = torch.nn.CrossEntropyLoss() - + def forward(self, x): return self.model(x) - + def training_step(self, batch, batch_idx): x, y = batch outputs = self.forward(x) loss = self.criterion(outputs, y) self.log("loss", loss, on_step=True, prog_bar=True) return loss - + def configure_optimizers(self): return torch.optim.Adam(self.model.parameters(), lr=0.001) @@ -83,9 +83,9 @@ Compare a PyTorch Lightning training script with and without Ray Train. trainer = pl.Trainer(max_epochs=10) trainer.fit(model, train_dataloaders=train_dataloader) - - .. group-tab:: PyTorch Lightning + Ray Train + + .. tab-item:: PyTorch Lightning + Ray Train .. code-block:: python :emphasize-lines: 8-10, 34, 43, 48-50, 52, 53, 55-60 @@ -108,20 +108,20 @@ Compare a PyTorch Lightning training script with and without Ray Train. self.model = resnet18(num_classes=10) self.model.conv1 = torch.nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) self.criterion = torch.nn.CrossEntropyLoss() - + def forward(self, x): return self.model(x) - + def training_step(self, batch, batch_idx): x, y = batch outputs = self.forward(x) loss = self.criterion(outputs, y) self.log("loss", loss, on_step=True, prog_bar=True) return loss - + def configure_optimizers(self): return torch.optim.Adam(self.model.parameters(), lr=0.001) - + def train_func(config): @@ -149,13 +149,13 @@ Compare a PyTorch Lightning training script with and without Ray Train. # [3] Launch distributed training job. trainer = TorchTrainer(train_func, scaling_config=scaling_config) - result = trainer.fit() + result = trainer.fit() Set up a training function -------------------------- -First, update your training code to support distributed training. +First, update your training code to support distributed training. Begin by wrapping your code in a :ref:`training function `: .. testcode:: @@ -167,7 +167,7 @@ Begin by wrapping your code in a :ref:`training function ` and :ref:`hyperparameter optimization `. +Reporting metrics and checkpoints to Ray Train enables you to support :ref:`fault-tolerant training ` and :ref:`hyperparameter optimization `. Note that the :class:`ray.train.lightning.RayTrainReportCallback` class only provides a simple implementation, and can be :ref:`further customized `. Prepare your Lightning Trainer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Finally, pass your Lightning Trainer into -:meth:`~ray.train.lightning.prepare_trainer` to validate -your configurations. +:meth:`~ray.train.lightning.prepare_trainer` to validate +your configurations. .. code-block:: diff @@ -340,7 +340,7 @@ For more details, see :ref:`train_scaling_config`. Launch a training job --------------------- -Tying this all together, you can now launch a distributed training job +Tying this all together, you can now launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`. .. testcode:: @@ -376,7 +376,7 @@ information about the training run, including the metrics and checkpoints report .. TODO: Add results guide Next steps ----------- +---------- After you have converted your PyTorch Lightning training script to use Ray Train: @@ -387,9 +387,9 @@ After you have converted your PyTorch Lightning training script to use Ray Train Version Compatibility --------------------- -Ray Train is tested with `pytorch_lightning` versions `1.6.5` and `2.0.4`. For full compatibility, use ``pytorch_lightning>=1.6.5`` . -Earlier versions aren't prohibited but may result in unexpected issues. If you run into any compatibility issues, consider upgrading your PyTorch Lightning version or -`file an issue `_. +Ray Train is tested with `pytorch_lightning` versions `1.6.5` and `2.0.4`. For full compatibility, use ``pytorch_lightning>=1.6.5`` . +Earlier versions aren't prohibited but may result in unexpected issues. If you run into any compatibility issues, consider upgrading your PyTorch Lightning version or +`file an issue `_. .. note:: @@ -400,25 +400,25 @@ Earlier versions aren't prohibited but may result in unexpected issues. If you r LightningTrainer Migration Guide -------------------------------- -Ray 2.4 introduced the `LightningTrainer`, and exposed a -`LightningConfigBuilder` to define configurations for `pl.LightningModule` -and `pl.Trainer`. +Ray 2.4 introduced the `LightningTrainer`, and exposed a +`LightningConfigBuilder` to define configurations for `pl.LightningModule` +and `pl.Trainer`. -It then instantiates the model and trainer objects and runs a pre-defined +It then instantiates the model and trainer objects and runs a pre-defined training function in a black box. -This version of the LightningTrainer API was constraining and limited +This version of the LightningTrainer API was constraining and limited your ability to manage the training functionality. -Ray 2.7 introduced the newly unified :class:`~ray.train.torch.TorchTrainer` API, which offers +Ray 2.7 introduced the newly unified :class:`~ray.train.torch.TorchTrainer` API, which offers enhanced transparency, flexibility, and simplicity. This API is more aligned -with standard PyTorch Lightning scripts, ensuring users have better +with standard PyTorch Lightning scripts, ensuring users have better control over their native Lightning code. -.. tabs:: +.. tab-set:: - .. group-tab:: (Deprecating) LightningTrainer + .. tab-item:: (Deprecating) LightningTrainer .. This snippet isn't tested because it raises a hard deprecation warning. @@ -460,24 +460,24 @@ control over their native Lightning code. ) ray_trainer.fit() - - .. group-tab:: (New API) TorchTrainer + + .. tab-item:: (New API) TorchTrainer .. This snippet isn't tested because it runs with 4 GPUs, and CI is only run with 1. .. testcode:: :skipif: True - + import lightning.pytorch as pl from ray.air import CheckpointConfig, RunConfig from ray.train.torch import TorchTrainer from ray.train.lightning import ( - RayDDPStrategy, + RayDDPStrategy, RayLightningEnvironment, RayTrainReportCallback, prepare_trainer - ) + ) def train_func(config): # [1] Create a Lightning model @@ -485,7 +485,7 @@ control over their native Lightning code. # [2] Report Checkpoint with callback ckpt_report_callback = RayTrainReportCallback() - + # [3] Create a Lighting Trainer datamodule = MNISTDataModule(batch_size=32) diff --git a/doc/source/train/getting-started-pytorch.rst b/doc/source/train/getting-started-pytorch.rst index 5cb6b70a2b3e..75f0be4cefab 100644 --- a/doc/source/train/getting-started-pytorch.rst +++ b/doc/source/train/getting-started-pytorch.rst @@ -26,7 +26,7 @@ For reference, the final code is as follows: def train_func(config): # Your PyTorch training code here. - + scaling_config = ScalingConfig(num_workers=2, use_gpu=True) trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() @@ -37,9 +37,9 @@ For reference, the final code is as follows: Compare a PyTorch training script with and without Ray Train. -.. tabs:: +.. tab-set:: - .. group-tab:: PyTorch + .. tab-item:: PyTorch .. This snippet isn't tested because it doesn't use any Ray code. @@ -74,14 +74,14 @@ Compare a PyTorch training script with and without Ray Train. optimizer.zero_grad() loss.backward() optimizer.step() - - checkpoint_dir = tempfile.gettempdir() + + checkpoint_dir = tempfile.gettempdir() checkpoint_path = checkpoint_dir + "/model.checkpoint" torch.save(model.state_dict(), checkpoint_path) - - .. group-tab:: PyTorch + Ray Train + + .. tab-item:: PyTorch + Ray Train .. code-block:: python :emphasize-lines: 9, 10, 12, 17, 18, 26, 27, 41, 42, 44-49 @@ -122,13 +122,13 @@ Compare a PyTorch training script with and without Ray Train. optimizer.zero_grad() loss.backward() optimizer.step() - - checkpoint_dir = tempfile.gettempdir() + + checkpoint_dir = tempfile.gettempdir() checkpoint_path = checkpoint_dir + "/model.checkpoint" torch.save(model.state_dict(), checkpoint_path) # [3] Report metrics and checkpoint. ray.train.report({"loss": loss.item()}, checkpoint=Checkpoint.from_directory(checkpoint_dir)) - + # [4] Configure scaling and resource requirements. scaling_config = ScalingConfig(num_workers=2, use_gpu=True) @@ -139,7 +139,7 @@ Compare a PyTorch training script with and without Ray Train. Set up a training function -------------------------- -First, update your training code to support distributed training. +First, update your training code to support distributed training. Begin by wrapping your code in a :ref:`training function `: .. testcode:: @@ -163,7 +163,7 @@ Use the :func:`ray.train.torch.prepare_model` utility function to: -from torch.nn.parallel import DistributedDataParallel +import ray.train.torch - def train_func(config): + def train_func(config): ... @@ -175,7 +175,7 @@ Use the :func:`ray.train.torch.prepare_model` utility function to: - model = model.to(device_id or "cpu") - model = DistributedDataParallel(model, device_ids=[device_id]) + model = ray.train.torch.prepare_model(model) - + ... Set up a dataset @@ -183,10 +183,10 @@ Set up a dataset .. TODO: Update this to use Ray Data. -Use the :func:`ray.train.torch.prepare_data_loader` utility function, which: +Use the :func:`ray.train.torch.prepare_data_loader` utility function, which: 1. Adds a ``DistributedSampler`` to your ``DataLoader``. -2. Moves the batches to the right device. +2. Moves the batches to the right device. Note that this step isn't necessary if you're passing in Ray Data to your Trainer. See :ref:`data-ingest-torch`. @@ -202,9 +202,9 @@ See :ref:`data-ingest-torch`. ... dataset = ... - + data_loader = DataLoader(dataset, batch_size=worker_batch_size) - - data_loader = DataLoader(dataset, batch_size=worker_batch_size, sampler=DistributedSampler(dataset)) + - data_loader = DataLoader(dataset, batch_size=worker_batch_size, sampler=DistributedSampler(dataset)) + data_loader = ray.train.torch.prepare_data_loader(data_loader) for X, y in data_loader: @@ -265,7 +265,7 @@ For more details, see :ref:`train_scaling_config`. Launch a training job --------------------- -Tying this all together, you can now launch a distributed training job +Tying this all together, you can now launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`. .. testcode:: @@ -305,4 +305,4 @@ After you have converted your PyTorch training script to use Ray Train: * See :ref:`User Guides ` to learn more about how to perform specific tasks. * Browse the :ref:`Examples ` for end-to-end examples of how to use Ray Train. -* Dive into the :ref:`API Reference ` for more details on the classes and methods used in this tutorial. \ No newline at end of file +* Dive into the :ref:`API Reference ` for more details on the classes and methods used in this tutorial. diff --git a/doc/source/train/getting-started-transformers.rst b/doc/source/train/getting-started-transformers.rst index 6fd223fc3a22..8739addee17a 100644 --- a/doc/source/train/getting-started-transformers.rst +++ b/doc/source/train/getting-started-transformers.rst @@ -24,7 +24,7 @@ For reference, the final code follows: def train_func(config): # Your Transformers training code here. - + scaling_config = ScalingConfig(num_workers=2, use_gpu=True) trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() @@ -35,9 +35,9 @@ For reference, the final code follows: Compare a Hugging Face Transformers training script with and without Ray Train. -.. tabs:: +.. tab-set:: - .. group-tab:: Hugging Face Transformers + .. tab-item:: Hugging Face Transformers .. This snippet isn't tested because it doesn't use any Ray code. @@ -52,7 +52,7 @@ Compare a Hugging Face Transformers training script with and without Ray Train. from transformers import ( Trainer, TrainingArguments, - AutoTokenizer, + AutoTokenizer, AutoModelForSequenceClassification, ) @@ -95,9 +95,9 @@ Compare a Hugging Face Transformers training script with and without Ray Train. # Start Training trainer.train() - - .. group-tab:: Hugging Face Transformers + Ray Train + + .. tab-item:: Hugging Face Transformers + Ray Train .. code-block:: python :emphasize-lines: 11-13, 15-18, 55-72 @@ -108,7 +108,7 @@ Compare a Hugging Face Transformers training script with and without Ray Train. from transformers import ( Trainer, TrainingArguments, - AutoTokenizer, + AutoTokenizer, AutoModelForSequenceClassification, ) @@ -116,7 +116,7 @@ Compare a Hugging Face Transformers training script with and without Ray Train. from ray.train import ScalingConfig from ray.train.torch import TorchTrainer - # [1] Encapsulate data preprocessing, training, and evaluation + # [1] Encapsulate data preprocessing, training, and evaluation # logic in a training function # ============================================================ def train_func(config): @@ -179,7 +179,7 @@ Compare a Hugging Face Transformers training script with and without Ray Train. Set up a training function -------------------------- -First, update your training code to support distributed training. +First, update your training code to support distributed training. You can begin by wrapping your code in a :ref:`training function `: .. testcode:: @@ -188,24 +188,24 @@ You can begin by wrapping your code in a :ref:`training function `. +Reporting metrics and checkpoints to Ray Train ensures that you can use Ray Tune and :ref:`fault-tolerant training `. Note that the :class:`ray.train.huggingface.transformers.RayTrainReportCallback` only provides a simple implementation, and you can :ref:`further customize ` it. @@ -228,8 +228,8 @@ Prepare a Transformers Trainer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Finally, pass your Transformers Trainer into -:meth:`~ray.train.huggingface.transformers.prepare_trainer` to validate -your configurations and enable Ray Data Integration. +:meth:`~ray.train.huggingface.transformers.prepare_trainer` to validate +your configurations and enable Ray Data Integration. .. code-block:: diff @@ -264,7 +264,7 @@ For more details, see :ref:`train_scaling_config`. Launch a training job --------------------- -Tying this all together, you can now launch a distributed training job +Tying this all together, you can now launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`. .. testcode:: @@ -300,7 +300,7 @@ information about the training run, including the metrics and checkpoints report .. TODO: Add results guide Next steps ----------- +---------- After you have converted your Hugging Face Transformers training script to use Ray Train: @@ -314,18 +314,18 @@ After you have converted your Hugging Face Transformers training script to use R TransformersTrainer Migration Guide ----------------------------------- -Ray 2.1 introduced the `TransformersTrainer`, which exposes a `trainer_init_per_worker` interface +Ray 2.1 introduced the `TransformersTrainer`, which exposes a `trainer_init_per_worker` interface to define `transformers.Trainer`, then runs a pre-defined training function in a black box. -Ray 2.7 introduced the newly unified :class:`~ray.train.torch.TorchTrainer` API, +Ray 2.7 introduced the newly unified :class:`~ray.train.torch.TorchTrainer` API, which offers enhanced transparency, flexibility, and simplicity. This API aligns more -with standard Hugging Face Transformers scripts, ensuring that you have better control over your +with standard Hugging Face Transformers scripts, ensuring that you have better control over your native Transformers training code. -.. tabs:: +.. tab-set:: - .. group-tab:: (Deprecating) TransformersTrainer + .. tab-item:: (Deprecating) TransformersTrainer .. This snippet isn't tested because it contains skeleton code. @@ -379,9 +379,9 @@ native Transformers training code. datasets={"train": ray_train_ds, "evaluation": ray_eval_ds}, ) result = ray_trainer.fit() - - .. group-tab:: (New API) TorchTrainer + + .. tab-item:: (New API) TorchTrainer .. This snippet isn't tested because it contains skeleton code. @@ -433,7 +433,7 @@ native Transformers training code. weight_decay=0.01, max_steps=100, ) - + trainer = transformers.Trainer( model=model, args=args, diff --git a/doc/source/train/huggingface-accelerate.rst b/doc/source/train/huggingface-accelerate.rst index 706ded67f241..4a3bfca9ec6c 100644 --- a/doc/source/train/huggingface-accelerate.rst +++ b/doc/source/train/huggingface-accelerate.rst @@ -35,7 +35,7 @@ You only need to run your existing training code with a TorchTrainer. You can ex # Start training ... - + from ray.train.torch import TorchTrainer from ray.train import ScalingConfig @@ -48,23 +48,23 @@ You only need to run your existing training code with a TorchTrainer. You can ex .. tip:: - Model and data preparation for distributed training is completely handled by the `Accelerator `_ + Model and data preparation for distributed training is completely handled by the `Accelerator `_ object and its `Accelerator.prepare() `_ method. - - Unlike with native PyTorch, PyTorch Lightning, or Hugging Face Transformers, **don't** call any additional Ray Train utilities - like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training function. + + Unlike with native PyTorch, PyTorch Lightning, or Hugging Face Transformers, **don't** call any additional Ray Train utilities + like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training function. Configure Accelerate -------------------- -In Ray Train, you can set configurations through the `accelerate.Accelerator `_ +In Ray Train, you can set configurations through the `accelerate.Accelerator `_ object in your training function. Below are starter examples for configuring Accelerate. -.. tabs:: +.. tab-set:: - .. group-tab:: DeepSpeed + .. tab-item:: DeepSpeed - For example, to run DeepSpeed with Accelerate, create a `DeepSpeedPlugin `_ + For example, to run DeepSpeed with Accelerate, create a `DeepSpeedPlugin `_ from a dictionary: .. testcode:: @@ -99,7 +99,7 @@ object in your training function. Below are starter examples for configuring Acc } def train_func(config): - # Create a DeepSpeedPlugin from config dict + # Create a DeepSpeedPlugin from config dict ds_plugin = DeepSpeedPlugin(hf_ds_config=DEEPSPEED_CONFIG) # Initialize Accelerator @@ -107,7 +107,7 @@ object in your training function. Below are starter examples for configuring Acc ..., deepspeed_plugin=ds_plugin, ) - + # Start training ... @@ -121,9 +121,10 @@ object in your training function. Below are starter examples for configuring Acc ) trainer.fit() - .. group-tab:: FSDP + .. tab-item:: FSDP + :sync: FSDP - For PyTorch FSDP, create a `FullyShardedDataParallelPlugin `_ + For PyTorch FSDP, create a `FullyShardedDataParallelPlugin `_ and pass it to the Accelerator. .. testcode:: @@ -135,11 +136,11 @@ object in your training function. Below are starter examples for configuring Acc def train_func(config): fsdp_plugin = FullyShardedDataParallelPlugin( state_dict_config=FullStateDictConfig( - offload_to_cpu=False, + offload_to_cpu=False, rank0_only=False ), optim_state_dict_config=FullOptimStateDictConfig( - offload_to_cpu=False, + offload_to_cpu=False, rank0_only=False ) ) @@ -163,16 +164,16 @@ object in your training function. Below are starter examples for configuring Acc ) trainer.fit() -Note that Accelerate also provides a CLI tool, `"accelerate config"`, to generate a configuration and launch your training -job with `"accelerate launch"`. However, it's not necessary here because Ray's `TorchTrainer` already sets up the Torch +Note that Accelerate also provides a CLI tool, `"accelerate config"`, to generate a configuration and launch your training +job with `"accelerate launch"`. However, it's not necessary here because Ray's `TorchTrainer` already sets up the Torch distributed environment and launches the training function on all workers. Next, see these end-to-end examples below for more details: -.. tabs:: +.. tab-set:: - .. group-tab:: Example with Ray Data + .. tab-item:: Example with Ray Data .. dropdown:: Show Code @@ -181,7 +182,7 @@ Next, see these end-to-end examples below for more details: :start-after: __accelerate_torch_basic_example_start__ :end-before: __accelerate_torch_basic_example_end__ - .. group-tab:: Example with PyTorch DataLoader + .. tab-item:: Example with PyTorch DataLoader .. dropdown:: Show Code @@ -192,8 +193,8 @@ Next, see these end-to-end examples below for more details: .. seealso:: - If you're looking for more advanced use cases, check out this Llama-2 fine-tuning example: - + If you're looking for more advanced use cases, check out this Llama-2 fine-tuning example: + - `Fine-tuning Llama-2 series models with Deepspeed, Accelerate, and Ray Train. `_ You may also find these user guides helpful: @@ -204,16 +205,14 @@ You may also find these user guides helpful: - :ref:`How to use Ray Data with Ray Train ` -AccelerateTrainer Migration Guide +AccelerateTrainer Migration Guide --------------------------------- -Before Ray 2.7, Ray Train's `AccelerateTrainer` API was the -recommended way to run Accelerate code. As a subclass of :class:`TorchTrainer `, -the AccelerateTrainer takes in a configuration file generated by ``accelerate config`` and applies it to all workers. +Before Ray 2.7, Ray Train's `AccelerateTrainer` API was the +recommended way to run Accelerate code. As a subclass of :class:`TorchTrainer `, +the AccelerateTrainer takes in a configuration file generated by ``accelerate config`` and applies it to all workers. Aside from that, the functionality of ``AccelerateTrainer`` is identical to ``TorchTrainer``. -However, this caused confusion around whether this was the *only* way to run Accelerate code. -Because you can express the full Accelerate functionality with the ``Accelerator`` and ``TorchTrainer`` combination, the plan is to deprecate the ``AccelerateTrainer`` in Ray 2.8, -and it's recommend to run your Accelerate code directly with ``TorchTrainer``. - - +However, this caused confusion around whether this was the *only* way to run Accelerate code. +Because you can express the full Accelerate functionality with the ``Accelerator`` and ``TorchTrainer`` combination, the plan is to deprecate the ``AccelerateTrainer`` in Ray 2.8, +and it's recommend to run your Accelerate code directly with ``TorchTrainer``. diff --git a/doc/source/train/user-guides/data-loading-preprocessing.rst b/doc/source/train/user-guides/data-loading-preprocessing.rst index 79dc297c6d1d..02adfcd6d502 100644 --- a/doc/source/train/user-guides/data-loading-preprocessing.rst +++ b/doc/source/train/user-guides/data-loading-preprocessing.rst @@ -3,7 +3,7 @@ Data Loading and Preprocessing ============================== -Ray Train integrates with :ref:`Ray Data ` to offer an efficient, streaming solution for loading and preprocessing large datasets. +Ray Train integrates with :ref:`Ray Data ` to offer an efficient, streaming solution for loading and preprocessing large datasets. We recommend using Ray Data for its ability to performantly support large-scale distributed training workloads - for advantages and comparisons to alternatives, see :ref:`Ray Data Overview `. In this guide, we will cover how to incorporate Ray Data into your Ray Train script, and different ways to customize your data ingestion pipeline. @@ -29,9 +29,9 @@ Data ingestion can be set up with four basic steps: 3. Input the preprocessed Dataset into the Ray Train Trainer. 4. Consume the Ray Dataset in your training function. -.. tabs:: +.. tab-set:: - .. group-tab:: PyTorch + .. tab-item:: PyTorch .. testcode:: @@ -92,13 +92,13 @@ Data ingestion can be set up with four basic steps: ... - .. group-tab:: PyTorch Lightning + .. tab-item:: PyTorch Lightning .. code-block:: python :emphasize-lines: 9,10,13,14,25,26 from ray import train - + train_data = ray.data.read_csv("./train.csv") val_data = ray.data.read_csv("./validation.csv") @@ -120,8 +120,8 @@ Data ingestion can be set up with four basic steps: # Feed the Ray dataset iterables to ``pl.Trainer.fit``. trainer.fit( - model, - train_dataloaders=train_dataloader, + model, + train_dataloaders=train_dataloader, val_dataloaders=val_dataloader ) @@ -131,15 +131,15 @@ Data ingestion can be set up with four basic steps: scaling_config=ScalingConfig(num_workers=4), ) trainer.fit() - - .. group-tab:: HuggingFace Transformers + + .. tab-item:: HuggingFace Transformers .. code-block:: python :emphasize-lines: 12,13,16,17,24,25 import ray import ray.train - + ... train_data = ray.data.from_huggingface(hf_train_ds) @@ -207,14 +207,14 @@ Inputting and splitting data Your preprocessed datasets can be passed into a Ray Train Trainer (e.g. :class:`~ray.train.torch.TorchTrainer`) through the ``datasets`` argument. -The datasets passed into the Trainer's ``datasets`` can be accessed inside of the ``train_loop_per_worker`` run on each distributed training worker by calling :meth:`ray.train.get_dataset_shard`. +The datasets passed into the Trainer's ``datasets`` can be accessed inside of the ``train_loop_per_worker`` run on each distributed training worker by calling :meth:`ray.train.get_dataset_shard`. All datasets are split (i.e. sharded) across the training workers by default. :meth:`~ray.train.get_dataset_shard` will return ``1/n`` of the dataset, where ``n`` is the number of training workers. .. note:: - Please be aware that as the evaluation dataset is split, users have to aggregate the evaluation results across workers. - You might consider using `TorchMetrics `_ (:ref:`example `) or + Please be aware that as the evaluation dataset is split, users have to aggregate the evaluation results across workers. + You might consider using `TorchMetrics `_ (:ref:`example `) or utilities available in other frameworks that you can explore. This behavior can be overwritten by passing in the ``dataset_config`` argument. For more information on configuring splitting logic, see :ref:`Splitting datasets `. @@ -265,45 +265,45 @@ At a high level, you can compare these concepts as follows: For more details, see the following sections for each framework. -.. tabs:: +.. tab-set:: - .. tab:: PyTorch Dataset and DataLoader + .. tab-item:: PyTorch Dataset and DataLoader **Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via ``datasets`` argument. - Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. + Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`. - + For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders `. **Option 2 (without Ray Data):** Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``. You can use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training. - - .. tab:: LightningDataModule + + .. tab-item:: LightningDataModule The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here. - .. tab:: Hugging Face Dataset + .. tab-item:: Hugging Face Dataset **Option 1 (with Ray Data):** Convert your Hugging Face Dataset to a Ray Dataset and pass it into the Trainer via the ``datasets`` argument. - Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. + Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. For instructions, see :ref:`Ray Data for Hugging Face `. **Option 2 (without Ray Data):** Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``. - .. tip:: +.. tip:: - When using Torch or Hugging Face Datasets directly without Ray Data, make sure to instantiate your Dataset *inside* the ``train_loop_per_worker``. - Instatiating the Dataset outside of the ``train_loop_per_worker`` and passing it in via global scope - can cause errors due to the Dataset not being serializable. + When using Torch or Hugging Face Datasets directly without Ray Data, make sure to instantiate your Dataset *inside* the ``train_loop_per_worker``. + Instatiating the Dataset outside of the ``train_loop_per_worker`` and passing it in via global scope + can cause errors due to the Dataset not being serializable. .. _train-datasets-split: Splitting datasets ------------------ -By default, Ray Train splits all datasets across workers using :meth:`Dataset.streaming_split `. Each worker sees a disjoint subset of the data, instead of iterating over the entire dataset. Unless randomly shuffled, the same splits are used for each iteration of the dataset. +By default, Ray Train splits all datasets across workers using :meth:`Dataset.streaming_split `. Each worker sees a disjoint subset of the data, instead of iterating over the entire dataset. Unless randomly shuffled, the same splits are used for each iteration of the dataset. -If want to customize which datasets are split, pass in a :class:`DataConfig ` to the Trainer constructor. +If want to customize which datasets are split, pass in a :class:`DataConfig ` to the Trainer constructor. For example, to split only the training dataset, do the following: @@ -325,7 +325,7 @@ For example, to split only the training dataset, do the following: for _ in range(2): for batch in train_ds.iter_batches(batch_size=128): print("Do some training on batch", batch) - + # Get the unsharded full validation dataset val_ds = train.get_dataset_shard("val") for _ in range(2): @@ -419,7 +419,7 @@ Ray Data has two approaches to random shuffling: 1. Shuffling data blocks and local shuffling on each training worker. This requires less communication at the cost of less randomness (i.e. rows that appear in the same data block are more likely to appear near each other in the iteration order). 2. Full global shuffle, which is more expensive. This will fully decorrelate row iteration order from the original dataset order, at the cost of significantly more computation, I/O, and communication. -For most cases, option 1 suffices. +For most cases, option 1 suffices. First, randomize each :ref:`block ` of your dataset via :meth:`randomize_block_order `. Then, when iterating over your dataset during training, enable local shuffling by specifying a ``local_shuffle_buffer_size`` to :meth:`iter_batches ` or :meth:`iter_torch_batches `. @@ -489,10 +489,10 @@ When developing or hyperparameter tuning models, reproducibility is important du "s3://anonymous@ray-example-data/sms_spam_collection_subset.txt" ) -**Step 2:** Set a seed for any shuffling operations: +**Step 2:** Set a seed for any shuffling operations: * `seed` argument to :meth:`random_shuffle ` -* `seed` argument to :meth:`randomize_block_order ` +* `seed` argument to :meth:`randomize_block_order ` * `local_shuffle_seed` argument to :meth:`iter_batches ` **Step 3:** Follow the best practices for enabling reproducibility for your training framework of choice. For example, see the `Pytorch reproducibility guide `_. diff --git a/doc/source/train/user-guides/experiment-tracking.rst b/doc/source/train/user-guides/experiment-tracking.rst index 7858c64bb536..00ab612669ea 100644 --- a/doc/source/train/user-guides/experiment-tracking.rst +++ b/doc/source/train/user-guides/experiment-tracking.rst @@ -5,17 +5,17 @@ Experiment Tracking =================== .. note:: - This guide is relevant for all trainers in which you define a custom training loop. - This includes :class:`TorchTrainer ` and + This guide is relevant for all trainers in which you define a custom training loop. + This includes :class:`TorchTrainer ` and :class:`TensorflowTrainer `. -Most experiment tracking libraries work out-of-the-box with Ray Train. -This guide provides instructions on how to set up the code so that your favorite experiment tracking libraries -can work for distributed training with Ray Train. The end of the guide has common errors to aid in debugging +Most experiment tracking libraries work out-of-the-box with Ray Train. +This guide provides instructions on how to set up the code so that your favorite experiment tracking libraries +can work for distributed training with Ray Train. The end of the guide has common errors to aid in debugging the setup. -The following pseudo code demonstrates how to use the native experiment tracking library calls -inside of Ray Train: +The following pseudo code demonstrates how to use the native experiment tracking library calls +inside of Ray Train: .. testcode:: :skipif: True @@ -30,9 +30,9 @@ inside of Ray Train: trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() -Ray Train lets you use native experiment tracking libraries by customizing the tracking -logic inside the :ref:`train_func` function. -In this way, you can port your experiment tracking logic to Ray Train with minimal changes. +Ray Train lets you use native experiment tracking libraries by customizing the tracking +logic inside the :ref:`train_func` function. +In this way, you can port your experiment tracking logic to Ray Train with minimal changes. Getting Started =============== @@ -41,13 +41,13 @@ Let's start by looking at some code snippets. The following examples uses Weights & Biases (W&B) and MLflow but it's adaptable to other frameworks. -.. tabs:: +.. tab-set:: - .. tab:: W&B + .. tab-item:: W&B .. testcode:: :skipif: True - + import ray from ray import train import wandb @@ -80,11 +80,11 @@ The following examples uses Weights & Biases (W&B) and MLflow but it's adaptable if train.get_context().get_world_rank() == 0: wandb.finish() - .. tab:: MLflow + .. tab-item:: MLflow .. testcode:: :skipif: True - + from ray import train import mlflow @@ -104,7 +104,7 @@ The following examples uses Weights & Biases (W&B) and MLflow but it's adaptable loss = optimize() metrics = {"loss": loss} - # Only report the results from the first worker to MLflow + # Only report the results from the first worker to MLflow to avoid duplication # Step 3 @@ -113,9 +113,9 @@ The following examples uses Weights & Biases (W&B) and MLflow but it's adaptable .. tip:: - A major difference between distributed and non-distributed training is that in distributed training, - multiple processes are running in parallel and under certain setups they have the same results. If all - of them report results to the tracking backend, you may get duplicated results. To address that, + A major difference between distributed and non-distributed training is that in distributed training, + multiple processes are running in parallel and under certain setups they have the same results. If all + of them report results to the tracking backend, you may get duplicated results. To address that, Ray Train lets you apply logging logic to only the rank 0 worker with the following method: :meth:`ray.train.get_context().get_world_rank() `. @@ -129,7 +129,7 @@ The following examples uses Weights & Biases (W&B) and MLflow but it's adaptable # Add your logging logic only for rank0 worker. ... -The interaction with the experiment tracking backend within the :ref:`train_func` +The interaction with the experiment tracking backend within the :ref:`train_func` has 4 logical steps: #. Set up the connection to a tracking backend @@ -145,60 +145,60 @@ Step 1: Connect to your tracking backend First, decide which tracking backend to use: W&B, MLflow, TensorBoard, Comet, etc. If applicable, make sure that you properly set up credentials on each training worker. -.. tabs:: +.. tab-set:: - .. tab:: W&B - - W&B offers both *online* and *offline* modes. + .. tab-item:: W&B + + W&B offers both *online* and *offline* modes. **Online** - For *online* mode, because you log to W&B's tracking service, ensure that you set the credentials - inside of :ref:`train_func`. See :ref:`Set up credentials` + For *online* mode, because you log to W&B's tracking service, ensure that you set the credentials + inside of :ref:`train_func`. See :ref:`Set up credentials` for more information. .. testcode:: :skipif: True - + # This is equivalent to `os.environ["WANDB_API_KEY"] = "your_api_key"` wandb.login(key="your_api_key") **Offline** - For *offline* mode, because you log towards a local file system, - point the offline directory to a shared storage path that all nodes can write to. + For *offline* mode, because you log towards a local file system, + point the offline directory to a shared storage path that all nodes can write to. See :ref:`Set up a shared file system` for more information. - + .. testcode:: :skipif: True os.environ["WANDB_MODE"] = "offline" - wandb.init(dir="some_shared_storage_path/wandb") + wandb.init(dir="some_shared_storage_path/wandb") + + .. tab-item:: MLflow - .. tab:: MLflow - - MLflow offers both *local* and *remote* (for example, to Databrick's MLflow service) modes. + MLflow offers both *local* and *remote* (for example, to Databrick's MLflow service) modes. **Local** - For *local* mode, because you log to a local file - system, point offline directory to a shared storage path. that all nodes can write - to. See :ref:`Set up a shared file system` for more information. - + For *local* mode, because you log to a local file + system, point offline directory to a shared storage path. that all nodes can write + to. See :ref:`Set up a shared file system` for more information. + .. testcode:: :skipif: True mlflow.start_run(tracking_uri="file:some_shared_storage_path/mlruns") **Remote, hosted by Databricks** - - Ensure that all nodes have access to the Databricks config file. + + Ensure that all nodes have access to the Databricks config file. See :ref:`Set up credentials` for more information. - + .. testcode:: :skipif: True - # The MLflow client looks for a Databricks config file + # The MLflow client looks for a Databricks config file # at the location specified by `os.environ["DATABRICKS_CONFIG_FILE"]`. os.environ["DATABRICKS_CONFIG_FILE"] = config["databricks_config_file"] mlflow.set_tracking_uri("databricks") @@ -212,7 +212,7 @@ Set up credentials Refer to each tracking library's API documentation on setting up credentials. This step usually involves setting an environment variable or accessing a config file. -The easiest way to pass an environment variable credential to training workers is through +The easiest way to pass an environment variable credential to training workers is through :ref:`runtime environments `, where you initialize with the following code: .. testcode:: @@ -230,19 +230,19 @@ One way to do this is by setting up a shared storage. Another way is to save a c Set up a shared file system ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Set up a network filesystem accessible to all nodes in the cluster. +Set up a network filesystem accessible to all nodes in the cluster. For example, AWS EFS or Google Cloud Filestore. -Step 2: Configure and start the run +Step 2: Configure and start the run ----------------------------------- This step usually involves picking an identifier for the run and associating it with a project. -Refer to the tracking libraries' documentation for semantics. +Refer to the tracking libraries' documentation for semantics. -.. To conveniently link back to Ray Train run, you may want to log the persistent storage path +.. To conveniently link back to Ray Train run, you may want to log the persistent storage path .. of the run as a config. -.. +.. .. testcode:: def train_func(config): @@ -250,10 +250,10 @@ Refer to the tracking libraries' documentation for semantics. wandb.init(..., config={"ray_train_persistent_storage_path": "TODO: fill in when API stablizes"}) .. tip:: - - When performing **fault-tolerant training** with auto-restoration, use a + + When performing **fault-tolerant training** with auto-restoration, use a consistent ID to configure all tracking runs that logically belong to the same training run. - One way to acquire an unique ID is with the following method: + One way to acquire an unique ID is with the following method: :meth:`ray.train.get_context().get_trial_id() `. .. testcode:: @@ -269,41 +269,41 @@ Refer to the tracking libraries' documentation for semantics. ... trainer = TorchTrainer( - train_func, + train_func, run_config=RunConfig(failure_config=FailureConfig(max_failures=3)) ) trainer.fit() - + Step 3: Log metrics ------------------- -You can customize how to log parameters, metrics, models, or media contents, within -:ref:`train_func`, just as in a non-distributed training script. -You can also use native integrations that a particular tracking framework has with -specific training frameworks. For example, ``mlflow.pytorch.autolog()``, -``lightning.pytorch.loggers.MLFlowLogger``, etc. +You can customize how to log parameters, metrics, models, or media contents, within +:ref:`train_func`, just as in a non-distributed training script. +You can also use native integrations that a particular tracking framework has with +specific training frameworks. For example, ``mlflow.pytorch.autolog()``, +``lightning.pytorch.loggers.MLFlowLogger``, etc. Step 4: Finish the run ---------------------- -This step ensures that all logs are synced to the tracking service. Depending on the implementation of -various tracking libraries, sometimes logs are first cached locally and only synced to the tracking -service in an asynchronous fashion. -Finishing the run makes sure that all logs are synced by the time training workers exit. +This step ensures that all logs are synced to the tracking service. Depending on the implementation of +various tracking libraries, sometimes logs are first cached locally and only synced to the tracking +service in an asynchronous fashion. +Finishing the run makes sure that all logs are synced by the time training workers exit. + +.. tab-set:: -.. tabs:: + .. tab-item:: W&B - .. tab:: W&B - .. testcode:: :skipif: True # https://docs.wandb.ai/ref/python/finish wandb.finish() - .. tab:: MLflow + .. tab-item:: MLflow .. testcode:: :skipif: True @@ -311,13 +311,13 @@ Finishing the run makes sure that all logs are synced by the time training worke # https://mlflow.org/docs/1.2.0/python_api/mlflow.html mlflow.end_run() - .. tab:: Comet + .. tab-item:: Comet .. testcode:: :skipif: True # https://www.comet.com/docs/v2/api-and-sdk/python-sdk/reference/Experiment/#experimentend - Experiment.end() + Experiment.end() Examples ======== @@ -345,10 +345,10 @@ PyTorch PyTorch Lightning ----------------- -You can use the native Logger integration in PyTorch Lightning with W&B, CometML, MLFlow, +You can use the native Logger integration in PyTorch Lightning with W&B, CometML, MLFlow, and Tensorboard, while using Ray Train's TorchTrainer. -The following example walks you through the process. The code here is runnable. +The following example walks you through the process. The code here is runnable. .. dropdown:: W&B @@ -382,7 +382,7 @@ The following example walks you through the process. The code here is runnable. :start-after: __lightning_experiment_tracking_comet_start__ .. dropdown:: TensorBoard - + .. literalinclude:: ../../../../python/ray/train/examples/experiment_tracking/lightning_exp_tracking_model_dl.py :language: python :start-after: __model_dl_start__ @@ -398,15 +398,15 @@ Common Errors Missing Credentials ------------------- -**I have already called `wandb login` cli, but am still getting** +**I have already called `wandb login` cli, but am still getting** .. code-block:: none wandb: ERROR api_key not configured (no-tty). call wandb.login(key=[your_api_key]). This is probably due to wandb credentials are not set up correctly -on worker nodes. Make sure that you run ``wandb.login`` -or pass ``WANDB_API_KEY`` to each training function. +on worker nodes. Make sure that you run ``wandb.login`` +or pass ``WANDB_API_KEY`` to each training function. See :ref:`Set up credentials ` for more details. Missing Configurations @@ -418,7 +418,7 @@ Missing Configurations databricks_cli.utils.InvalidConfigurationError: You haven't configured the CLI yet! -This is usually caused by running ``databricks configure`` which +This is usually caused by running ``databricks configure`` which generates ``~/.databrickscfg`` only on head node. Move this file to a shared location or copy it to each node. See :ref:`Set up credentials ` for more details.