Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 8 additions & 27 deletions docs/source/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ This guide shows how to use vLLM to:

Be sure to complete the :ref:`installation instructions <installation>` before continuing with this guide.

.. note::

By default, vLLM downloads model from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_ in the following examples, please set the environment variable:

.. code-block:: shell

export VLLM_USE_MODELSCOPE=True

Offline Batched Inference
-------------------------

Expand Down Expand Up @@ -40,16 +48,6 @@ Initialize vLLM's engine for offline inference with the ``LLM`` class and the `O

llm = LLM(model="facebook/opt-125m")

Use model from www.modelscope.cn

.. code-block:: shell

export VLLM_USE_MODELSCOPE=True

.. code-block:: python

llm = LLM(model="qwen/Qwen-7B-Chat", revision="v1.1.8", trust_remote_code=True)

Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all the output tokens.

.. code-block:: python
Expand Down Expand Up @@ -77,16 +75,6 @@ Start the server:

$ python -m vllm.entrypoints.api_server

Use model from www.modelscope.cn

.. code-block:: console

$ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.api_server \
$ --model="qwen/Qwen-7B-Chat" \
$ --revision="v1.1.8" \
$ --trust-remote-code


By default, this command starts the server at ``http://localhost:8000`` with the OPT-125M model.

Query the model in shell:
Expand Down Expand Up @@ -116,13 +104,6 @@ Start the server:
$ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-125m

Use model from www.modelscope.cn

.. code-block:: console

$ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.openai.api_server \
$ --model="qwen/Qwen-7B-Chat" --revision="v1.1.8" --trust-remote-code

By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:

.. code-block:: console
Expand Down