Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions doc/source/_static/css/ray-libraries.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
.sd-card.body {
display: flex;
flex-direction: column;
}
.sd-card-body > figure {
height: 10em;
}
.card-figure {
object-fit: contain;
width: 100%;
height: 100%;
}
14 changes: 14 additions & 0 deletions doc/source/_static/css/ray-train.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#train-logo > #train-logo-icon > path, #train-logo > #train-logo-icon > circle {
stroke: var(--ray-blue);
stroke-width: 10;
fill: transparent;
}

#train-logo > #train-logo-text {
stroke: var(--pst-color-text-base);
fill: var(--pst-color-text-base);
}

#train-logo {
margin: 3em 0em;
}
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ kubectl get service
- name: RAY_PROMETHEUS_HOST
value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
```
* Note that we do not deploy Grafana in the head Pod, so we need to set both `RAY_GRAFANA_IFRAME_HOST` and `RAY_GRAFANA_HOST`.
* Note that we do not deploy Grafana in the head Pod, so we need to set both `RAY_GRAFANA_IFRAME_HOST` and `RAY_GRAFANA_HOST`.
`RAY_GRAFANA_HOST` is used by the head Pod to send health-check requests to Grafana in the backend.
`RAY_GRAFANA_IFRAME_HOST` is used by your browser to fetch the Grafana panels from the Grafana server rather than from the head Pod.
Because we forward the port of Grafana to `127.0.0.1:3000` in this example, we set `RAY_GRAFANA_IFRAME_HOST` to `http://127.0.0.1:3000`.
Expand Down Expand Up @@ -135,8 +135,7 @@ spec:
* See [ServiceMonitor official document](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#servicemonitor) for more details about the configurations.
* `release: $HELM_RELEASE`: Prometheus can only detect ServiceMonitor with this label.

<div id="prometheus-can-only-detect-this-label" ></div>

(prometheus-can-only-detect-this-label)=
```sh
helm ls -n prometheus-system
# ($HELM_RELEASE is "prometheus".)
Expand Down Expand Up @@ -241,8 +240,8 @@ spec:
)
```

* The PromQL expression above is:
$$\frac{ number\ of\ update\ resource\ usage\ RPCs\ that\ have\ RTT\ smaller\ then\ 20ms\ in\ last\ 30\ days\ }{total\ number\ of\ update\ resource\ usage\ RPCs\ in\ last\ 30\ days\ } \times 100 $$
* The PromQL expression above is:
$$\frac{ number\ of\ update\ resource\ usage\ RPCs\ that\ have\ RTT\ smaller\ then\ 20ms\ in\ last\ 30\ days\ }{total\ number\ of\ update\ resource\ usage\ RPCs\ in\ last\ 30\ days\ } \times 100 $$


* The recording rule above is one of rules defined in [prometheusRules.yaml](https://github.com/ray-project/kuberay/blob/master/config/prometheus/rules/prometheusRules.yaml), and it is created by **install.sh**. Hence, no need to create anything here.
Expand Down Expand Up @@ -356,4 +355,4 @@ kubectl port-forward --address 0.0.0.0 svc/raycluster-embed-grafana-head-svc 826
# Visit http://127.0.0.1:8265/#/metrics in your browser.
```

![Ray Dashboard with Grafana panels](../images/ray_dashboard_embed_grafana.png)
![Ray Dashboard with Grafana panels](../images/ray_dashboard_embed_grafana.png)
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# RayService troubleshooting

RayService is a Custom Resource Definition (CRD) designed for Ray Serve. In KubeRay, creating a RayService will first create a RayCluster and then
create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts
create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts
or Ray Serve configurations (`serveConfigV2`), troubleshooting may be challenging. This section provides some tips to help you debug these issues.

## Observability
Expand Down Expand Up @@ -104,7 +104,7 @@ Some tips to help you debug the `serveConfigV2` field:

* Check [the documentation](serve-api) for the schema about
the Ray Serve Multi-application API `PUT "/api/serve/applications/"`.
* Unlike `serveConfig`, `serveConfigV2` adheres to the snake case naming convention. For example, `numReplicas` is used in `serveConfig`, while `num_replicas` is used in `serveConfigV2`.
* Unlike `serveConfig`, `serveConfigV2` adheres to the snake case naming convention. For example, `numReplicas` is used in `serveConfig`, while `num_replicas` is used in `serveConfigV2`.

(kuberay-raysvc-issue3-1)=
### Issue 3-1: The Ray image does not include the required dependencies.
Expand Down Expand Up @@ -222,9 +222,9 @@ You may encounter the following error message when KubeRay tries to get Serve ap
Get "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: connect: connection refused"
```

As mentioned in [Issue 5](#issue-5-fail-to-create--update-serve-applications), the KubeRay operator submits a `Put` request to the RayCluster for creating Serve applications once the head Pod is ready.
After the successful submission of the `Put` request to the dashboard agent, a `Get` request is sent to the dashboard agent port (i.e., 52365).
The successful submission indicates that all the necessary components, including the dashboard agent, are fully operational.
As mentioned in [Issue 5](#kuberay-raysvc-issue5), the KubeRay operator submits a `Put` request to the RayCluster for creating Serve applications once the head Pod is ready.
After the successful submission of the `Put` request to the dashboard agent, a `Get` request is sent to the dashboard agent port (i.e., 52365).
The successful submission indicates that all the necessary components, including the dashboard agent, are fully operational.
Therefore, unlike Issue 5, the failure of the `Get` request is not expected.

If you consistently encounter this issue, there are several possible causes:
Expand Down Expand Up @@ -323,7 +323,7 @@ However, Ray Serve does not support deploying both API V1 and API V2 in the clus
Hence, if users want to perform in-place upgrades by replacing `serveConfig` with `serveConfigV2`, they may encounter the following error message:

```
ray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application
ray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application
config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either
redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a
multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,17 @@ If you don't find an answer to your question here, please don't hesitate to conn
- [Worker init container](#worker-init-container)
- [Cluster domain](#cluster-domain)
- [RayService](#rayservice)
- [GPU multi-tenancy](#gpu-multitenancy)
- [Other questions](#questions)

## Upgrade KubeRay

If you have issues upgrading KubeRay, refer to the [upgrade guide](#kuberay-upgrade-guide).
Most issues are about the CRD version.

(worker-init-container)=
## Worker init container

The KubeRay operator injects a default [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) into every worker Pod.
The KubeRay operator injects a default [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) into every worker Pod.
This init container is responsible for waiting until the Global Control Service (GCS) on the head Pod is ready before establishing a connection to the head.
The init container will use `ray health-check` to check the GCS server status continuously.

Expand All @@ -46,6 +46,7 @@ To disable the injection, set the `ENABLE_INIT_CONTAINER_INJECTION` environment
Please refer to [#1069](https://github.com/ray-project/kuberay/pull/1069) and the [KubeRay Helm chart](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L83-L87) for instructions on how to set the environment variable.
Once disabled, you can add your custom init container to the worker Pod template.

(cluster-domain)=
## Cluster domain

In KubeRay, we use Fully Qualified Domain Names (FQDNs) to establish connections between workers and the head.
Expand All @@ -58,12 +59,14 @@ To set a custom cluster domain, adjust the `CLUSTER_DOMAIN` environment variable
Helm chart users can make this modification [here](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L88-L91).
For more information, please refer to [#951](https://github.com/ray-project/kuberay/pull/951) and [#938](https://github.com/ray-project/kuberay/pull/938) for more details.

(rayservice)=
## RayService

RayService is a Custom Resource Definition (CRD) designed for Ray Serve. In KubeRay, creating a RayService will first create a RayCluster and then
create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts
create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts
or Ray Serve configurations (`serveConfigV2`), troubleshooting may be challenging. See [rayservice-troubleshooting](kuberay-raysvc-troubleshoot) for more details.

(questions)=
## Questions

### Why are changes to the RayCluster or RayJob CR not taking effect?
Expand Down
12 changes: 7 additions & 5 deletions doc/source/cluster/kubernetes/user-guides/rayservice.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ kubectl apply -f ray_v1alpha1_rayservice.yaml
deployments: ...
```

## Step 4: Verify the Kubernetes cluster status
## Step 4: Verify the Kubernetes cluster status

```sh
# Step 4.1: List all RayService custom resources in the `default` namespace.
Expand Down Expand Up @@ -113,7 +113,7 @@ If you don't use`rayservice-sample-head-svc`, you need to update the ingress con
However, if you use `rayservice-sample-head-svc`, KubeRay automatically updates the selector to point to the new head Pod, eliminating the need to update the ingress configuration.


> Note: Default ports and their definitions.
> Note: Default ports and their definitions.

| Port | Definition |
|-------|---------------------|
Expand Down Expand Up @@ -180,8 +180,9 @@ curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:800
# [Expected output]: "15 pizzas please!"
```

* `rayservice-sample-serve-svc` is HA in general. It does traffic routing among all the workers which have Serve deployments and always tries to point to the healthy cluster, even during upgrading or failing cases.
* `rayservice-sample-serve-svc` is HA in general. It does traffic routing among all the workers which have Serve deployments and always tries to point to the healthy cluster, even during upgrading or failing cases.

(step-7-in-place-update-for-ray-serve-applications)=
## Step 7: In-place update for Ray Serve applications

You can update the configurations for the applications by modifying `serveConfigV2` in the RayService config file. Reapplying the modified config with `kubectl apply` reapplies the new configurations to the existing RayCluster instead of creating a new RayCluster.
Expand Down Expand Up @@ -214,15 +215,16 @@ curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:800
# [Expected output]: 8
```

(step-8-zero-downtime-upgrade-for-ray-clusters)=
## Step 8: Zero downtime upgrade for Ray clusters

In Step 7, modifying `serveConfigV2` doesn't trigger a zero downtime upgrade for Ray clusters.
Instead, it reapplies the new configurations to the existing RayCluster.
However, if you modify `spec.rayClusterConfig` in the RayService YAML file, it triggers a zero downtime upgrade for Ray clusters.
RayService temporarily creates a new RayCluster and waits for it to be ready, then switches traffic to the new RayCluster by updating the selector of the head service managed by RayService (that is, `rayservice-sample-head-svc`) and terminates the old one.

During the zero downtime upgrade process, RayService creates a new RayCluster temporarily and waits for it to become ready.
Once the new RayCluster is ready, RayService updates the selector of the head service managed by RayService (that is, `rayservice-sample-head-svc`) to point to the new RayCluster to switch the traffic to the new RayCluster.
During the zero downtime upgrade process, RayService creates a new RayCluster temporarily and waits for it to become ready.
Once the new RayCluster is ready, RayService updates the selector of the head service managed by RayService (that is, `rayservice-sample-head-svc`) to point to the new RayCluster to switch the traffic to the new RayCluster.
Finally, the old RayCluster is terminated.

Certain exceptions don't trigger a zero downtime upgrade.
Expand Down
22 changes: 13 additions & 9 deletions doc/source/data/examples/batch_training.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a Dataset <a class=\"anchor\" id=\"create_ds\"></a>"
"(create_ds)=\n",
"## Creating a Dataset"
]
},
{
Expand Down Expand Up @@ -230,7 +231,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering a Dataset on Read <a class=\"anchor\" id=\"filter_ds\"></a>\n",
"(filter_ds)=\n",
"### Filtering a Dataset on Read\n",
"\n",
"Normally there is some last-mile data processing required before training. Let's just assume we know the data processing steps are:\n",
"- Drop negative trip distances, 0 fares, 0 passengers.\n",
Expand Down Expand Up @@ -309,7 +311,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Inspecting a Dataset <a class=\"anchor\" id=\"inspect_ds\"></a>\n",
"(inspect_ds)=\n",
"### Inspecting a Dataset\n",
"\n",
"Let's get some basic statistics about our newly created Dataset.\n",
"\n",
Expand Down Expand Up @@ -394,11 +397,11 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transforming a Dataset in parallel using custom functions <a class=\"anchor\" id=\"transform_ds\"></a>\n",
"(transform_ds)=\n",
"### Transforming a Dataset in parallel using custom functions\n",
"\n",
"Ray Data allows you to specify custom data transform functions. These [user defined functions (UDFs)](transforming_data) can be called using `Dataset.map_batches(my_function)`. The transformation will be conducted in parallel for each data batch.\n",
"\n",
Expand Down Expand Up @@ -496,7 +499,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Batch training with Ray Data <a class=\"anchor\" id=\"batch_train_ds\"></a>"
"(batch_train_ds)=\n",
"## Batch training with Ray Data"
]
},
{
Expand Down Expand Up @@ -668,7 +672,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -1028,7 +1031,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Re-load a model and perform batch prediction <a class=\"anchor\" id=\"load_model\"></a>"
"(load_model)=\n",
"## Re-load a model and perform batch prediction"
]
},
{
Expand Down Expand Up @@ -1292,7 +1296,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
"version": "3.10.8"
},
"vscode": {
"interpreter": {
Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/key-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ or remote filesystems.

.. testoutput::
:options: +MOCK

['..._000000.parquet', '..._000001.parquet']


Expand Down
11 changes: 2 additions & 9 deletions doc/source/ray-contribute/docs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -287,8 +287,7 @@
"Here's an example:\n",
"\n",
"````markdown\n",
"```{code-cell} python3\n",
":tags: [hide-cell]\n",
"```python\n",
"\n",
"import ray\n",
"import ray.rllib.agents.ppo as ppo\n",
Expand Down Expand Up @@ -316,11 +315,7 @@
"cell_type": "code",
"execution_count": null,
"id": "78cac353",
"metadata": {
"tags": [
"hide-cell"
]
},
"metadata": {},
"outputs": [],
"source": [
"import ray\n",
Expand All @@ -347,8 +342,6 @@
"id": "d716d0bd",
"metadata": {},
"source": [
"As you can see, the code block is hidden, but you can expand it by click on the \"+\" button.\n",
"\n",
"### Tags for your notebook\n",
"\n",
"What makes this work is the `:tags: [hide-cell]` directive in the `code-cell`.\n",
Expand Down
Loading