Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Commit

Permalink
[Doc] Added cerebrium as Integration option (vllm-project#5553)
Browse files Browse the repository at this point in the history
  • Loading branch information
milo157 authored and robertgshaw2-redhat committed Jun 23, 2024
1 parent 3d24777 commit c5ef2f9
Show file tree
Hide file tree
Showing 2 changed files with 110 additions and 0 deletions.
109 changes: 109 additions & 0 deletions docs/source/serving/deploying_with_cerebrium.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
.. _deploying_with_cerebrium:

Deploying with Cerebrium
============================

.. raw:: html

<p align="center">
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
</p>

vLLM can be run on a cloud based GPU machine with `Cerebrium <https://www.cerebrium.ai/>`__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.

To install the Cerebrium client, run:

.. code-block:: console
$ pip install cerebrium
$ cerebrium login
Next, create your Cerebrium project, run:

.. code-block:: console
$ cerebrium init vllm-project
Next, to install the required packages, add the following to your cerebrium.toml:

.. code-block:: toml
[cerebrium.dependencies.pip]
vllm = "latest"
Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:

.. code-block:: python
from vllm import LLM, SamplingParams
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
results = []
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results}
Then, run the following code to deploy it to the cloud

.. code-block:: console
$ cerebrium deploy
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)

.. code-block:: python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-H 'Content-Type: application/json' \
-H 'Authorization: <JWT TOKEN>' \
--data '{
"prompts": [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"
]
}'
You should get a response like:

.. code-block:: python
{
"run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
"result": {
"result": [
{
"prompt": "Hello, my name is",
"generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
},
{
"prompt": "The president of the United States is",
"generated_text": " elected every four years. This is a democratic system.\n\n5. What"
},
{
"prompt": "The capital of France is",
"generated_text": " Paris.\n"
},
{
"prompt": "The future of AI is",
"generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
}
]
},
"run_time_ms": 152.53663063049316
}
You now have an autoscaling endpoint where you only pay for the compute you use!

1 change: 1 addition & 0 deletions docs/source/serving/integrations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Integrations
deploying_with_kserve
deploying_with_triton
deploying_with_bentoml
deploying_with_cerebrium
deploying_with_lws
deploying_with_dstack
serving_with_langchain

0 comments on commit c5ef2f9

Please sign in to comment.