sgl-project · Fridge003 · Jan 24, 2026 · Jan 21, 2026 · Jan 21, 2026 · Jan 21, 2026
diff --git a/docs/advanced_features/lora.ipynb b/docs/advanced_features/lora.ipynb
@@ -394,21 +394,21 @@
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "### OpenAI-compatible API usage\n",
-    "\n",
-    "You can use LoRA adapters via the OpenAI-compatible APIs by specifying the adapter in the `model` field using the `base-model:adapter-name` syntax (for example, `qwen/qwen2.5-0.5b-instruct:adapter_a`). For more details and examples, see the “Using LoRA Adapters” section in the OpenAI API documentation: [openai_api_completions.ipynb](../basic_usage/openai_api_completions.ipynb).\n"
+    "terminate_process(server_process)"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "terminate_process(server_process)"
+    "### OpenAI-compatible API usage\n",
+    "\n",
+    "You can use LoRA adapters via the OpenAI-compatible APIs by specifying the adapter in the `model` field using the `base-model:adapter-name` syntax (for example, `qwen/qwen2.5-0.5b-instruct:adapter_a`). For more details and examples, see the “Using LoRA Adapters” section in the OpenAI API documentation: [openai_api_completions.ipynb](../basic_usage/openai_api_completions.ipynb).\n"
    ]
   },
   {
@@ -569,6 +569,133 @@
     "terminate_process(server_process)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## LoRA Overlap Loading"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By using the `--enable-lora-overlap-loading` server argument, the SGLang engine is able to overlap the loading of LoRA weights with prefill and decode compute, essentially hiding the data movement for LoRA weights behind GPU computation. Our benchmarks show that under adversarial conditions, enabling this feature can result in a ~35% reduction in median TTFT - (see the [LoRA overlap loading PR](https://github.com/sgl-project/sglang/pull/15512) for detailed benchmarks)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lora0 = \"Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16\"\n",
+    "lora1 = \"algoprog/fact-generation-llama-3.1-8b-instruct-lora\"\n",
+    "lora2 = \"philschmid/code-llama-3-1-8b-text-to-sql-lora\"\n",
+    "\n",
+    "\n",
+    "server_process, port = launch_server_cmd(\n",
+    "    \"\"\"\n",
+    "    python3 -m sglang.launch_server \\\n",
+    "    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
+    "    --enable-lora \\\n",
+    "    --enable-lora-overlap-loading \\\n",
+    "    --lora-paths lora0=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
+    "    lora1=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
+    "    lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora \\\n",
+    "    --max-lora-rank 256 \\\n",
+    "    --max-loras-per-batch 2 \\\n",
+    "    --max-loaded-loras 4\n",
+    "    \"\"\"\n",
+    ")\n",
+    "\n",
+    "url = f\"http://127.0.0.1:{port}\"\n",
+    "wait_for_server(url)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "json_data = {\n",
+    "    \"text\": [\n",
+    "        \"Write a very long fairy-tale.\",\n",
+    "        \"List 3 countries and their capitals.\",\n",
+    "        \"List 3 countries and their capitals.\",\n",
+    "    ],\n",
+    "    \"sampling_params\": [\n",
+    "        {\"max_new_tokens\": 1024, \"temperature\": 0},\n",
+    "        {\"max_new_tokens\": 64, \"temperature\": 0},\n",
+    "        {\"max_new_tokens\": 64, \"temperature\": 0},\n",
+    "    ],\n",
+    "    \"lora_path\": [\"lora0\", \"lora1\", \"lora2\"],\n",
+    "}\n",
+    "\n",
+    "# lora0 and lora1 will be loaded into the memory pool first, and because max_loras_per_batch = 2, lora2's request will remain in the queue.\n",
+    "# lora1's request will likely finish first, and once it does, lora2 will be loaded. With --enable-lora-overlap-loading, this loading will\n",
+    "# occur asynchronously and thus decoding for lora0's request won't be blocked.\n",
+    "response = requests.post(\n",
+    "    url + \"/generate\",\n",
+    "    json=json_data,\n",
+    ")\n",
+    "\n",
+    "for i in range(3):\n",
+    "    print(f\"Output from lora{i}: \\n{response.json()[i]['text']}\\n\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "terminate_process(server_process)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Limitations of LoRA Overlap Loading"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "However, LoRA overlap loading is not free and comes with two important caveats:\n",
+    "\n",
+    "1. **Pinned CPU memory requirement**:\n",
+    "   Asynchronous H2D memory copies require LoRA weights to be pinned in CPU memory, which is a finite system resource. To mitigate excessive pinned-memory usage, SGLang currently restricts `max_loaded_loras` to be at most 2× `max_loras_per_batch` when LoRA overlap loading is enabled.\n",
+    "\n",
+    "2. **Reduced multi-adapter prefill batching**:\n",
+    "   With overlap loading, adapters become available on the GPU at different times because each adapter is loaded asynchronously. This can reduce the scheduler’s ability to form multi-adapter prefill batches, since only requests whose adapters are currently loaded can be grouped together. As a result, requests for different adapters will be scheduled in separate (or smaller) prefill batches, which can increase TTFT when adapter load time is small compared to prefill compute time. This is why LoRA overlap loading is disabled by default: it should only be enabled when users have determined that LoRA weight loading is a bottleneck (EG high adapter churn, heavy adapter weights, or PCIe-bottlenecked workloads).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Example When Overlap Loading Results in Higher Latency"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For instance, suppose we have four LoRA adapters: `lora0`, `lora1`, `lora2`, and `lora3`. Loading any adapter takes 2ms, while the prefill step for requests for that adapter takes 20ms.\n",
+    "\n",
+    "1. **Baseline**:\n",
+    "  The engine loads all four adapters synchronously, then runs one combined prefill batch, giving us a total time of ≈ `2 * 4 + 20 = 28ms`\n",
+    "\n",
+    "2. **With LoRA overlap loading enabled**:\n",
+    "  The engine begins loading `lora0` and, once it is ready, schedules a prefill batch containing only `lora0` while `lora1` loads in the background. Then it schedules `lora1`’s prefill while `lora2` loads, and so on. In the worst case where prefill cannot be batched across adapters, total time is ≈ `2 + 4 * 20 = 82ms`\n",
+    "\n",
+    "In this scenario, overlap loading reduces adapter-load overhead, but the loss of multi-adapter prefill batching dominates and leads to higher TTFT."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

@@ -1,4 +1,5 @@
 import logging
+from typing import Type
 
 from sglang.srt.lora.backend.base_backend import BaseLoRABackend
 
@@ -50,7 +51,7 @@ def create_flashinfer_backend():
     )
 
 
-def get_backend_from_name(name: str) -> BaseLoRABackend:
+def get_backend_from_name(name: str) -> Type[BaseLoRABackend]:
     """
     Get corresponding backend class from backend's name
     """

@@ -55,13 +55,13 @@ def __init__(
         max_loras_per_batch: int,
         load_config: LoadConfig,
         dtype: torch.dtype,
+        server_args: ServerArgs,
         lora_backend: str = "triton",
         tp_size: int = 1,
         tp_rank: int = 0,
         max_lora_rank: Optional[int] = None,
         target_modules: Optional[Iterable[str]] = None,
         lora_paths: Optional[List[LoRARef]] = None,
-        server_args: Optional[ServerArgs] = None,
     ):
         self.base_model: torch.nn.Module = base_model
         self.base_hf_config: AutoConfig = base_hf_config

@@ -195,10 +195,6 @@ def load_lora_adapter_from_tensors(
         )
         return result
 
-    def can_run_lora_batch(self, lora_ids: list[str]) -> bool:
-        lora_ids_set = set(lora_ids) if isinstance(lora_ids, list) else lora_ids
-        return self.model_runner.lora_manager.validate_lora_batch(lora_ids_set)
-
     def forward_batch_embedding(self, model_worker_batch: ModelWorkerBatch):
         forward_batch = ForwardBatch.init_new(model_worker_batch, self.model_runner)
         logits_output = self.model_runner.forward(forward_batch).logits_output

@@ -1424,13 +1424,13 @@ def init_lora_manager(self):
             max_loras_per_batch=self.server_args.max_loras_per_batch,
             load_config=self.load_config,
             dtype=self.dtype,
+            server_args=self.server_args,
             lora_backend=self.server_args.lora_backend,
             tp_size=self.tp_size,
             tp_rank=self.tp_rank,
             max_lora_rank=self.server_args.max_lora_rank,
             target_modules=self.server_args.lora_target_modules,
             lora_paths=self.server_args.lora_paths,
-            server_args=self.server_args,
         )
 
     def load_lora_adapter(self, lora_ref: LoRARef):