Skip to content

Commit

Permalink
commit for docs site (#60)
Browse files Browse the repository at this point in the history
Co-authored-by: Sri Tikkireddy <[email protected]>
  • Loading branch information
stikkireddy and Sri Tikkireddy authored Sep 23, 2024
1 parent 8e6d4df commit 131909d
Show file tree
Hide file tree
Showing 30 changed files with 2,376 additions and 62 deletions.
29 changes: 29 additions & 0 deletions .github/workflows/onpush-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: docs-publish
on:
push:
branches:
- master
- main
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure Git Credentials
run: |
git config user.name github-actions[bot]
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
- uses: actions/setup-python@v5
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
- uses: actions/cache@v4
with:
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- run: pip install mkdocs-material mkdocs-jupyter
- run: mkdocs gh-deploy --force
9 changes: 7 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ INTEGRATION ?= false


BDIST := $(PYTHON) setup.py bdist_wheel sdist
PIP_INSTALL := $(PYTHON) -m pip install
PIP_INSTALL := $(PYTHON) -m pip install
MKDOCS := mkdocs
PYTEST := pytest -s -n auto
BLACK := black --line-length 88
ISORT := isort --profile black --line-length 88
Expand Down Expand Up @@ -58,6 +59,10 @@ clean:
@$(FIND) $(SCRIPTS_DIR) -name \*.pyo -exec rm -f {} \;
@echo "Finishing cleaning up intermediate artifacts."

docs:
@echo "Serving documentation..."
@$(MKDOCS) serve

distclean: clean
@echo "Cleaning up distribution artifacts..."
@$(RM) $(DIST_DIR)
Expand Down Expand Up @@ -114,4 +119,4 @@ help:

@true

.PHONY: build upload check fmt clean distclean test coverage help
.PHONY: build upload check fmt clean distclean test coverage help docs
78 changes: 42 additions & 36 deletions docs/custom_server_architecture.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,76 @@
# Custom Server Architecture
# EzDeploy Architecture

This document goes over how the custom server frameoworks like vLLM, sglang and ray serve work in the context
This document goes over how the custom server frameoworks like vLLM, sglang and ray serve work in the context
mlflow-extensions. It tries to explain the architecture of the custom server frameworks and how they are integrated
into the default scoring server for mlflow and databricks deployments.

## Custom Supported Server Frameworks

1. vLLM
2. sglang
3. ray serve
2. SGLang
3. ray serve [WIP]
4. ollama

## Custom Pyfunc Implementation

All of these frameworks typically run in their own processes and have their own command to boot up. Every framework is
All of these frameworks typically run in their own processes and have their own command to boot up. Every framework is
wrapped under a generic pyfunc to proxy /invocation requests to the custom server. The custom pyfunc also serializes
and deserializes an httpx request and response object between the client (openai, httpx, sglang) to custom server implementation.
and deserializes an httpx request and response object between the client (openai, httpx, sglang) to custom server
implementation.

Here is a high level diagram of the architecture:

![mlflow-extensions-server.png](static%2Fmlflow-extensions-server.png)


Mlflow pyfuncs are split up into three main parts, `__init__`, `load_context` and `predict`. The constructor in this case
Mlflow pyfuncs are split up into three main parts, `__init__`, `load_context` and `predict`. The constructor in this
case
is a very simple implementation. The `load_context` function is where the interesting bits happen.


### Understanding `load_context`

Before we dive into `load_context` it is important to understand that when a model spawns in the scoring server it
spawns multiple workers, in the case of mlflow it will use gunicorn as the server framework. Each worker will have its
own instance of the model. This is important because some large models wont have enough resources if you are spawning
more than one instance of the model. So instead `load_context` on the worker will call a "filelock" (using the filelock library)
to lock various worker process using a filesystem file lock. This allows the various gunicorn processes to coordinate on
a single leader which will spawn instances of the model serving framework (vLLM, sglang, etc). This is important because
the model serving framework will have its own process and will be able to handle multiple requests at once. The last part
of the `load_context` function is to run a separate process to ensure the server framework is healthy, and up and
running. This is important, sometimes the SOTA server frameworks can be a bit unstable, throw segfaults, crash, etc.
This health check restarts the server if it is down.

Some room for improvements here would be to ensure that when the model is not in a healthy state that the requests should
let the clients know that the model is not healthy. This is important because the serving endpoint may be up but the model
Before we dive into `load_context` it is important to understand that when a model spawns in the scoring server it
spawns multiple workers, in the case of mlflow it will use gunicorn as the server framework. Each worker will have its
own instance of the model. This is important because some large models wont have enough resources if you are spawning
more than one instance of the model. So instead `load_context` on the worker will call a "filelock" (using the filelock
library)
to lock various worker process using a filesystem file lock. This allows the various gunicorn processes to coordinate on
a single leader which will spawn instances of the model serving framework (vLLM, sglang, etc). This is important because
the model serving framework will have its own process and will be able to handle multiple requests at once. The last
part
of the `load_context` function is to run a separate process to ensure the server framework is healthy, and up and
running. This is important, sometimes the SOTA server frameworks can be a bit unstable, throw segfaults, crash, etc.
This health check restarts the server if it is down.

Some room for improvements here would be to ensure that when the model is not in a healthy state that the requests
should
let the clients know that the model is not healthy. This is important because the serving endpoint may be up but the
model
may be down and in a recovering state.

### Understanding `predict`

The goal for predict is to proxy the request to the custom server. This is done by serializing the httpx Request into a
json string and sending it to the custom server. The custom server will then deserialize the json string back into an
The goal for predict is to proxy the request to the custom server. This is done by serializing the httpx Request into a
json string and sending it to the custom server. The custom server will then deserialize the json string back into an
httpx Request object and then run the request through the custom server. The custom server will then the proper response
back to the scoring server. The scoring server will then serialize the response back into a json string and send it back
to the client. The client will then deserialize the json string back into an httpx Response object. This is the biggest
overhead in the system. The serialization and deserialization of the httpx Request and Response objects. This is important
to note because the custom server will have to be able to handle the serialization and deserialization of the httpx efficiently.
back to the scoring server. The scoring server will then serialize the response back into a json string and send it back
to the client. The client will then deserialize the json string back into an httpx Response object. This is the biggest
overhead in the system. The serialization and deserialization of the httpx Request and Response objects. This is
important
to note because the custom server will have to be able to handle the serialization and deserialization of the httpx
efficiently.
We have not currently measured the overhead of this.

The trade off though mlflow scoring server does not support dynamic/adaptive batching. The client will have to handle
batching requests to the scoring server. Most models like llms, transformers, etc will have a batching strategy to load
batches of text, images, etc into the model and this framework will allow you to do that. So this is the trade off that
The trade off though mlflow scoring server does not support dynamic/adaptive batching. The client will have to handle
batching requests to the scoring server. Most models like llms, transformers, etc will have a batching strategy to load
batches of text, images, etc into the model and this framework will allow you to do that. So this is the trade off that
you will have to make when using the custom server frameworks. (e2e latency over dynamic batching)

### Compatability Clients

The `mlflow_extensions.serving.compat` has a few clients that are compatible with the custom server frameworks. These clients
are `OpenAI`, `ChatOpenAI`, `RuntimeEndpoint` (sglang), and standard httpx sync and async clients. These clients are
used to serialize and deserialize requests to the custom server.

The `mlflow_extensions.serving.compat` has a few clients that are compatible with the custom server frameworks. These
clients
are `OpenAI`, `ChatOpenAI`, `RuntimeEndpoint` (sglang), and standard httpx sync and async clients. These clients are
used to serialize and deserialize requests to the custom server.

### Request flow

Expand All @@ -81,5 +87,5 @@ In this the pyfunc wrapper refers to the gunicorn workers/scoring server (mlflow
9. Client deserializes the json string back into an httpx Response object
10. Compatability OpenAI Client receives the response and transforms it into a chat completion response

This is the detailed step by step process of how the request flows through the system. It all happens very quick but
This is the detailed step by step process of how the request flows through the system. It all happens very quick but
this is the indirection that is happening in the system.
Loading

0 comments on commit 131909d

Please sign in to comment.