You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|**[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)**|**[User Guides](https://docs.nvidia.com/dynamo/latest/index.html)**|**[Support Matrix](docs/support_matrix.md)**|**[Architecture and Features](docs/architecture/architecture.md)**|**[APIs](lib/bindings/python/README.md)**|**[SDK](deploy/dynamo/sdk/README.md)**|
24
-
25
-
### 📢 **Please join us for our**[**first Dynamo in-person meetup with vLLM and SGLang leads**](https://events.nvidia.com/nvidiadynamousermeetups)**on 6/5 (Thu) in SF!** ###
Copy file name to clipboardExpand all lines: docs/API/sdk.md
+1-9Lines changed: 1 addition & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,17 +17,9 @@ limitations under the License.
17
17
18
18
# Dynamo SDK
19
19
20
-
# Table of Contents
21
-
22
-
-[Introduction](#introduction)
23
-
-[Installation](#installation)
24
-
-[Core Concepts](#core-concepts)
25
-
-[Writing a Service](#writing-a-service)
26
-
-[Configuring a Service](#configuring-a-service)
27
-
-[Composing Services into an Graph](#composing-services-into-an-graph)
28
20
## Introduction
29
21
30
-
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See Python Bindings](./python_bindings.md).
22
+
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See [Python Bindings](./python_bindings.md).
31
23
32
24
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
Copy file name to clipboardExpand all lines: docs/architecture/architecture.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,13 +20,13 @@ limitations under the License.
20
20
21
21
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
22
22
23
-
-**Disaggregated prefill & decode inference** – Maximizes GPU throughput and helps you balance throughput and latency
24
-
-**Dynamic GPU scheduling** – Optimizes performance based on real-time demand
-**Accelerated data transfer**: Reduces inference response time using NIXL
27
+
-**KV cache offloading**: Uses multiple memory hierarchies for higher system throughput and lower latency
28
28
29
-
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach
29
+
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, Open Source Software (OSS)-first development approach
Copy file name to clipboardExpand all lines: docs/architecture/distributed_runtime.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,11 +61,11 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen
61
61
62
62
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
63
63
64
-
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_routers.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
64
+
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
65
65
66
-
-`random`: randomly select an endpoint to hit,
67
-
-`round_robin`: select endpoints in round-robin order,
68
-
-`direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint.
66
+
-`random`: randomly select an endpoint to hit
67
+
-`round_robin`: select endpoints in round-robin order
68
+
-`direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint
69
69
70
70
After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
71
71
@@ -77,7 +77,7 @@ We provide native rust and python (through binding) examples for basic usage of
77
77
- Python: `/lib/bindings/python/examples/`. We also provide a complete example of using `DistributedRuntime` for communication and Dynamo's LLM library for prompt templates and (de)tokenization to deploy a vllm-based service. Please refer to `lib/bindings/python/examples/hello_world/server_vllm.py` for details.
78
78
79
79
```{note}
80
-
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require exgtensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
80
+
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to be slow and requires extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
81
81
82
82
You can tune the number of parallel build jobs for building VLLM from source
83
83
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
0 commit comments