You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Input data is passed to :class:`~vllm.LLMEngine` (or :class:`~vllm.AsyncLLMEngine`).
7
+
8
+
2. Tokenize the data if necessary.
9
+
10
+
3. Process the inputs using :meth:`INPUT_REGISTRY.process_input <vllm.inputs.registry.InputRegistry.process_input>`.
11
+
12
+
- For example, add placeholder tokens to reserve KV cache for multi-modal embeddings.
13
+
14
+
4. Send the processed inputs to :class:`~vllm.executor.executor_base.ExecutorBase`.
15
+
16
+
5. Distribute the inputs via :class:`~vllm.worker.worker_base.WorkerBase` to :class:`~vllm.worker.model_runner_base.ModelRunnerBase`.
17
+
18
+
6. If the data contains multi-modal data, convert it into keyword arguments using :meth:`MULTIMODAL_REGISTRY.map_input <vllm.multimodal.MultiModalRegistry.map_input>`.
19
+
20
+
- For example, convert a :class:`PIL.Image.Image` input to its pixel values for a vision language model.
Copy file name to clipboardExpand all lines: docs/source/models/adding_model.rst
+2-2
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ For instance, vLLM's `OPT model <https://github.com/vllm-project/vllm/blob/main/
37
37
2. Rewrite the :code:`forward` methods
38
38
--------------------------------------
39
39
40
-
Next, you need to rewrite the :code:`forward` methods of your model by following these steps:
40
+
Next, you need to rewrite the :meth:`~torch.nn.Module.forward` method of your model by following these steps:
41
41
42
42
1. Remove any unnecessary code, such as the code only used for training.
43
43
2. Change the input parameters:
@@ -75,7 +75,7 @@ Next, you need to rewrite the :code:`forward` methods of your model by following
75
75
76
76
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
77
77
To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
78
-
For the embedding layer, you can simply replace :code:`nn.Embedding` with :code:`VocabParallelEmbedding`. For the output LM head, you can use :code:`ParallelLMHead`.
78
+
For the embedding layer, you can simply replace :class:`torch.nn.Embedding` with :code:`VocabParallelEmbedding`. For the output LM head, you can use :code:`ParallelLMHead`.
79
79
When it comes to the linear layers, we provide the following options to parallelize them:
80
80
81
81
* :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
0 commit comments