diff --git a/README.md b/README.md
index 7466706..db04556 100644
--- a/README.md
+++ b/README.md
@@ -17,7 +17,7 @@
 
 **Fara-7B** is Microsoft's first **agentic small language model (SLM)** designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA) that achieves state-of-the-art performance within its size class and is competitive with larger, more resource-intensive agentic systems.
 
-Try Fara-7B locally as follows (see [Installation](##Installation) for detailed instructions) or via Magentic-UI:
+Try Fara-7B locally as follows (see [Installation](#Installation) for detailed instructions on Windows ) or via Magentic-UI:
 
 ```bash
 # 1. Clone repository
@@ -44,7 +44,7 @@ To try Fara-7B inside Magentic-UI, please follow the instructions here [Magentic
 
 
 Notes:
-- If you're using Windows, we highly recommend using WSL2 (Windows Subsystem for Linux).
+- If you're using Windows, we highly recommend using WSL2 (Windows Subsystem for Linux). Please the Windows instructions in the [Installation](#Installation) section.
 - You might need to do `--tensor-parallel-size 2` with vllm command if you run out of memory
 
 <table>
@@ -156,27 +156,45 @@ Our evaluation setup leverages:
 
 ---
 
-## Installation
+# Installation
 
-Install the package using either UV or pip:
 
-```bash
-uv sync --all-extras
-```
+##  Linux 
+
+The following instructions are for Linux systems, see the Windows section below for Windows instructions. 
 
-or
+Install the package using pip and set up the environment with Playwright:
 
 ```bash
-pip install -e .
+# 1. Clone repository
+git clone https://github.com/microsoft/fara.git
+cd fara
+
+# 2. Setup environment
+python3 -m venv .venv 
+source .venv/bin/activate
+pip install -e .[vllm]
+playwright install
 ```
 
-Then install Playwright browsers:
+Note: If you plan on hosting with Azure Foundry only, you can skip the `[vllm]` and just do `pip install -e .`
+
+
+## Windows
+
+For Windows, we highly recommend using WSL2 (Windows Subsystem for Linux) to provide a Linux-like environment. However, if you prefer to run natively on Windows, follow these steps:
 
 ```bash
-playwright install
-```
+# 1. Clone repository
+git clone https://github.com/microsoft/fara.git
+cd fara
 
----
+# 2. Setup environment
+python3 -m venv .venv
+.venv\Scripts\activate
+pip install -e .
+python3 -m playwright install
+```
 
 ## Hosting the Model
 
@@ -189,11 +207,10 @@ Deploy Fara-7B on [Azure Foundry](https://ai.azure.com/explore/models/Fara-7B/ve
 **Setup:**
 
 1. Deploy the Fara-7B model on Azure Foundry and obtain your endpoint URL and API key
-2. Add your endpoint details to the existing `endpoint_configs/` directory (example configs are already provided):
 
-```bash
-# Edit one of the existing config files or create a new one
-# endpoint_configs/fara-7b-hosting-ansrz.json (example format):
+Then create a endpoint configuration JSON file (e.g., `azure_foundry_config.json`):
+
+```json
 {
     "model": "Fara-7B",
     "base_url": "https://your-endpoint.inference.ml.azure.com/",
@@ -201,62 +218,55 @@ Deploy Fara-7B on [Azure Foundry](https://ai.azure.com/explore/models/Fara-7B/ve
 }
 ```
 
-3. Run the Fara agent:
+Then you can run Fara-7B using this endpoint configuration.
+
+2. Run the Fara agent:
 
 ```bash
-fara-cli --task "how many pages does wikipedia have" --start_page "https://www.bing.com"
+fara-cli --task "how many pages does wikipedia have" --endpoint_config azure_foundry_config.json [--headful]
 ```
 
-That's it! No GPU or model downloads required.
+Note: you can also specify the endpoint config with the args `--base_url [your_base_url] --api_key [your_api_key] --model [your_model_name]` instead of using a config JSON file. 
 
-### Self-hosting with VLLM
-
-If you have access to GPU resources, you can self-host Fara-7B using VLLM. This requires a GPU machine with sufficient VRAM.
-
-All that is required is to run the following command to start the VLLM server:
+Note: If you see an error that the `fara-cli` command is not found, then try:
 
 ```bash
-vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto 
+python -m fara.run_fara --task "what is the weather in new york now"
 ```
 
-### Testing the Fara Agent
+That's it! No GPU or model downloads required.
 
-Run the test script to see Fara in action:
+### Self-hosting with vLLM or LM Studio / Ollama
+
+**If you have access to GPU resources, you can self-host Fara-7B using vLLM. This requires a GPU machine with sufficient VRAM (e.g., 24GB or more).**
+
+Only on Linux: all that is required is to run the following command to start the VLLM server:
 
 ```bash
-fara-cli --task "how many pages does wikipedia have" --start_page "https://www.bing.com" --endpoint_config endpoint_configs/azure_foundry_config.json [--headful] [--downloads_folder "/path/to/downloads"] [--save_screenshots] [--max_rounds 100] [--browserbase]
+vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto 
 ```
+For quantized models or lower VRAM GPUs, please see [Fara-7B GGUF on HuggingFace](https://huggingface.co/bartowski/microsoft_Fara-7B-GGUF).
 
-In self-hosting scenario the `endpoint_config` points to `endpoint_configs/vllm_config.json` from the VLLM server above.
+** For Windows/Mac, vLLM is not natively supported. You can use WSL2 on Windows to run the above command or LM Studio / Ollama as described below. **
 
-If you set `--browserbase`, export environment variables for the API key and project ID.
+Otherwise, you can use [LM Studio](https://lmstudio.ai/) or [Ollama](https://ollama.com/) to host the model locally. We currently recommend the following GGUF versions of our models [Fara-7B GGUF on HuggingFace](https://huggingface.co/bartowski/microsoft_Fara-7B-GGUF) for use with LM Studio or Ollama. Select the largest model that fits your GPU. Please ensure that context length is set to at least 15000 tokens and temperature to 0 for best results.
 
-#### Expected Output
-
-```
-Initializing Browser...
-Browser Running... Starting Fara Agent...
-##########################################
-Task: how many pages does wikipedia have
-##########################################
-Running Fara...
+Then you can run Fara-7B pointing to your local  server:
 
+Run the test script to see Fara in action:
 
-Thought #1: To find the current number of Wikipedia pages, I'll search for the latest Wikipedia page count statistics.
-Action #1: executing tool 'web_search' with arguments {"action": "web_search", "query": "Wikipedia total number of articles"}
-Observation#1: I typed 'Wikipedia total number of articles' into the browser search bar.
+```bash
+fara-cli --task "what is the weather in new york now"
+```
 
-Thought #2: Wikipedia currently has 7,095,446 articles.
-Action #2: executing tool 'terminate' with arguments {"action": "terminate", "status": "success"}
-Observation#2: Wikipedia currently has 7,095,446 articles.
+If you didn't use vLLM to host, please specify the correct `--base_url [your_base_url] --api_key [your_api_key] --model [your_model_name]`
 
-Final Answer: Wikipedia currently has 7,095,446 articles.
+If you see an error that the `fara-cli` command is not found, then try:
 
-Enter another task (or press Enter to exit): 
+```bash
+python -m fara.run_fara --task "what is the weather in new york now"
 ```
 
----
-
 # Reproducibility
 
 We provide a framework in `webeval/` to reproduce our results on WebVoyager and OnlineMind2Web. 
diff --git a/pyproject.toml b/pyproject.toml
index 8e858d9..6566c44 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -32,14 +32,20 @@ dependencies = [
     "pyyaml",
     "jsonschema",
     "browserbase",
-    "vllm>=0.10.0"
 ]
 
+
+
+
 [project.urls]
 Homepage = "https://github.com/microsoft/fara"
 Repository = "https://github.com/microsoft/fara"
 Issues = "https://github.com/microsoft/fara/issues"
 
+[project.optional-dependencies]
+vllm = ["vllm>=0.10.0"]
+lmstudio = ["lmstudio"]
+ollama = ["ollama"]
 
 [project.scripts]
 fara-cli = "fara.run_fara:main"
diff --git a/src/fara/browser/browser_bb.py b/src/fara/browser/browser_bb.py
index d0303ca..d4a2e6f 100644
--- a/src/fara/browser/browser_bb.py
+++ b/src/fara/browser/browser_bb.py
@@ -6,7 +6,7 @@
 import subprocess
 import time
 from typing import Any, Dict, Optional, Callable
-
+import platform
 import browserbase
 from browserbase import Browserbase
 from playwright.async_api import (
@@ -48,7 +48,7 @@ def __init__(
         self.single_tab_mode = single_tab_mode
         self.use_browser_base = use_browser_base
         self.logger = logger or logging.getLogger("browser_manager")
-
+        self.is_linux = platform.system() == "Linux"
         self._viewport_height = viewport_height
         self._viewport_width = viewport_width
 
@@ -194,7 +194,8 @@ async def delayed_resume():
 
     async def _init_regular_browser(self, channel: str = "chromium") -> None:
         """Initialize regular browser according to the specified channel."""
-        if not self.headless:
+        if not self.headless and self.is_linux:
+            print("STARTING XVFB")
             self.start_xvfb()
 
         launch_args: Dict[str, Any] = {"headless": self.headless}
@@ -218,7 +219,7 @@ async def _init_regular_browser(self, channel: str = "chromium") -> None:
 
     async def _init_persistent_browser(self) -> None:
         """Initialize persistent browser with data directory."""
-        if not self.headless:
+        if not self.headless and self.is_linux:
             self.start_xvfb()
 
         launch_args: Dict[str, Any] = {"headless": self.headless}
diff --git a/src/fara/fara_agent.py b/src/fara/fara_agent.py
index 09d52b5..ab6b953 100644
--- a/src/fara/fara_agent.py
+++ b/src/fara/fara_agent.py
@@ -15,7 +15,7 @@
 import asyncio
 from .browser.playwright_controller import PlaywrightController
 from ._prompts import get_computer_use_system_prompt
-from .types import (
+from .fara_types import (
     LLMMessage,
     SystemMessage,
     UserMessage,
@@ -379,15 +379,20 @@ async def run(self, user_message: str) -> Tuple:
             thoughts, action_dict = self._parse_thoughts_and_action(raw_response)
             action_args = action_dict.get("arguments", {})
             action = action_args["action"]
-            self.logger.info(f"\nThought #{i+1}: {thoughts}\nAction #{i+1}: executing tool '{action}' with arguments {json.dumps(action_args)}")
-
+            self.logger.debug(
+                f"\nThought #{i+1}: {thoughts}\nAction #{i+1}: executing tool '{action}' with arguments {json.dumps(action_args)}"
+            )
+            print(
+                f"\nThought #{i+1}: {thoughts}\nAction #{i+1}: executing tool '{action}' with arguments {json.dumps(action_args)}"
+            )
             (
                 is_stop_action,
                 new_screenshot,
                 action_description,
             ) = await self.execute_action(function_call)
             all_observations.append(action_description)
-            self.logger.info(f"Observation#{i+1}: {action_description}")
+            self.logger.debug(f"Observation#{i+1}: {action_description}")
+            print(f"Observation#{i+1}: {action_description}")
             if is_stop_action:
                 final_answer = thoughts
                 break
@@ -564,7 +569,7 @@ async def execute_action(
         elif args["action"] == "pause_and_memorize_fact":
             fact = str(args.get("fact"))
             self._facts.append(fact)
-            action_description= f"I memorized the following fact: {fact}"
+            action_description = f"I memorized the following fact: {fact}"
         elif args["action"] == "stop" or args["action"] == "terminate":
             action_description = args.get("thoughts")
             is_stop_action = True
diff --git a/src/fara/types.py b/src/fara/fara_types.py
similarity index 100%
rename from src/fara/types.py
rename to src/fara/fara_types.py
diff --git a/src/fara/run_fara.py b/src/fara/run_fara.py
index 3e1e207..9525828 100755
--- a/src/fara/run_fara.py
+++ b/src/fara/run_fara.py
@@ -1,8 +1,8 @@
 import asyncio
 import argparse
 import os
-from fara import FaraAgent
-from fara.browser.browser_bb import BrowserBB
+from .fara_agent import FaraAgent
+from .browser.browser_bb import BrowserBB
 import logging
 from typing import Dict
 from pathlib import Path
@@ -11,8 +11,8 @@
 
 # Configure logging to only show logs from fara.fara_agent
 logging.basicConfig(
-    level=logging.CRITICAL,  
-    format="%(message)s",  
+    level=logging.CRITICAL,
+    format="%(message)s",
 )
 
 # Enable INFO level only for fara.fara_agent
@@ -159,21 +159,51 @@ def main():
         default=None,
         help="Path to the endpoint configuration JSON file. By default, tries local vllm on 5000 port",
     )
+    parser.add_argument(
+        "--api_key",
+        type=str,
+        default=None,
+        help="API key for the model endpoint (overrides endpoint_config)",
+    )
+    parser.add_argument(
+        "--base_url",
+        type=str,
+        default=None,
+        help="Base URL for the model endpoint (overrides endpoint_config)",
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        default=None,
+        help="Model name to use (overrides endpoint_config)",
+    )
 
     args = parser.parse_args()
 
     if args.browserbase:
-        assert os.environ.get("BROWSERBASE_API_KEY"), (
-            "BROWSERBASE_API_KEY environment variable must be set to use browserbase"
-        )
-        assert os.environ.get("BROWSERBASE_PROJECT_ID"), (
-            "BROWSERBASE_API_KEY and BROWSERBASE_PROJECT_ID environment variables must be set to use browserbase"
-        )
+        assert os.environ.get(
+            "BROWSERBASE_API_KEY"
+        ), "BROWSERBASE_API_KEY environment variable must be set to use browserbase"
+        assert os.environ.get(
+            "BROWSERBASE_PROJECT_ID"
+        ), "BROWSERBASE_API_KEY and BROWSERBASE_PROJECT_ID environment variables must be set to use browserbase"
 
     endpoint_config = DEFAULT_ENDPOINT_CONFIG
     if args.endpoint_config:
         with open(args.endpoint_config, "r") as f:
             endpoint_config = json.load(f)
+            assert (
+                "api_key" in endpoint_config
+                and "base_url" in endpoint_config
+                and "model" in endpoint_config
+            ), "endpoint_config file must contain api_key, base_url, and model fields"
+    # Override with command-line arguments if provided
+    if args.api_key:
+        endpoint_config["api_key"] = args.api_key
+    if args.base_url:
+        endpoint_config["base_url"] = args.base_url
+    if args.model:
+        endpoint_config["model"] = args.model
 
     asyncio.run(
         run_fara_agent(
diff --git a/src/fara/vllm/az_vllm.py b/src/fara/vllm/az_vllm.py
index d34b9ac..01ceada 100644
--- a/src/fara/vllm/az_vllm.py
+++ b/src/fara/vllm/az_vllm.py
@@ -28,10 +28,15 @@
 
 
 def _is_azure_blob_url(model_path: str) -> bool:
-    return model_path.startswith(("https://", "http://")) and "blob.core.windows.net" in model_path
+    return (
+        model_path.startswith(("https://", "http://"))
+        and "blob.core.windows.net" in model_path
+    )
 
 
-def _download_model_from_hf(output_dir: Path, model_id: str = DEFAULT_HF_MODEL_ID) -> str:
+def _download_model_from_hf(
+    output_dir: Path, model_id: str = DEFAULT_HF_MODEL_ID
+) -> str:
     """Download model from HuggingFace Hub if not already present."""
     if snapshot_download is None:
         raise ImportError(
@@ -63,13 +68,15 @@ def _download_model_from_hf(output_dir: Path, model_id: str = DEFAULT_HF_MODEL_I
 
 def _extract_model_name(model_url: str) -> str:
     """Extract model name from URL for consistent naming."""
-    url_parts = model_url.rstrip('/').split('/')
+    url_parts = model_url.rstrip("/").split("/")
     return url_parts[-1] if url_parts else model_url
 
 
 def _cache_model(model_url: str) -> str:
     if AzFolder is None:
-        raise RuntimeError("Azure support not available. Install aztool or run without --cache.")
+        raise RuntimeError(
+            "Azure support not available. Install aztool or run without --cache."
+        )
 
     cache_root = Path(args.cache_dir or os.path.expanduser("~/.cache/vllm_models"))
     cache_root.mkdir(parents=True, exist_ok=True)
@@ -120,8 +127,18 @@ def _prepare_cached_model(model_url: str) -> str:
         raise FileNotFoundError(f"Local model directory not found: {model_url}")
     return str(model_path.resolve())
 
+
 class AzVllm:
-    def __init__(self, model_url, port, device_id, max_n_images, dtype='auto', enforce_eager=False, use_external_endpoint=False):
+    def __init__(
+        self,
+        model_url,
+        port,
+        device_id,
+        max_n_images,
+        dtype="auto",
+        enforce_eager=False,
+        use_external_endpoint=False,
+    ):
         self.model_az = None
         self.local_model_path = None
         self.vllm = None
@@ -141,7 +158,9 @@ def __init__(self, model_url, port, device_id, max_n_images, dtype='auto', enfor
                 if not model_path.exists():
                     # Auto-download from HuggingFace if path doesn't exist
                     logging.warning(f"Local model directory not found: {model_url}")
-                    logging.info(f"Attempting to download {DEFAULT_HF_MODEL_ID} from HuggingFace...")
+                    logging.info(
+                        f"Attempting to download {DEFAULT_HF_MODEL_ID} from HuggingFace..."
+                    )
                     self.local_model_path = _download_model_from_hf(model_path)
                 else:
                     self.local_model_path = str(model_path.resolve())
@@ -150,7 +169,7 @@ def __init__(self, model_url, port, device_id, max_n_images, dtype='auto', enfor
     def __enter__(self):
         # No-op if using external endpoint
         if self.use_external_endpoint:
-            print('Using external endpoint, skipping VLLM startup')
+            print("Using external endpoint, skipping VLLM startup")
             return self
 
         if self.model_az:
@@ -162,33 +181,35 @@ def __enter__(self):
                 for file in files:
                     print(f"\t{os.path.join(root, file)}")
             self.vllm = VLLM(
-                model_path = self.context.path,
-                port = self.port,
-                device_id = self.device_id,
-                max_n_images = self.max_n_images,
-                dtype = self.dtype,
-                enforce_eager = self.enforce_eager
+                model_path=self.context.path,
+                port=self.port,
+                device_id=self.device_id,
+                max_n_images=self.max_n_images,
+                dtype=self.dtype,
+                enforce_eager=self.enforce_eager,
             )
             self.vllm.start()
-            print('VLLM has started')
+            print("VLLM has started")
         elif self.local_model_path:
-            print(f"VLLM using on-disk model at path {self.local_model_path}, contents:")
+            print(
+                f"VLLM using on-disk model at path {self.local_model_path}, contents:"
+            )
             ### sometimes need to ls the directory or else huggingface will complain a config.json doesn't exist
             for root, dirs, files in os.walk(self.local_model_path):
                 for file in files:
                     print(f"\t{os.path.join(root, file)}")
             self.vllm = VLLM(
-                model_path = self.local_model_path,
-                port = self.port,
-                device_id = self.device_id,
-                max_n_images = self.max_n_images,
-                dtype = self.dtype,
-                enforce_eager = self.enforce_eager
+                model_path=self.local_model_path,
+                port=self.port,
+                device_id=self.device_id,
+                max_n_images=self.max_n_images,
+                dtype=self.dtype,
+                enforce_eager=self.enforce_eager,
             )
             self.vllm.start()
-            print('VLLM has started')
+            print("VLLM has started")
         return self
-    
+
     def __exit__(self, exc_type, exc_val, exc_tb):
         if self.vllm:
             if self.vllm and (self.vllm.status == Status.Running):
@@ -196,6 +217,7 @@ def __exit__(self, exc_type, exc_val, exc_tb):
         if self.context:
             self.context.unmount()
 
+
 @asynccontextmanager
 async def lifespan(app: FastAPI):
     cached_vllm: Optional[VLLM] = None
@@ -214,17 +236,18 @@ async def lifespan(app: FastAPI):
                 device_id=args.device_id,
                 max_n_images=args.max_n_images,
                 dtype=args.dtype,
-                enforce_eager=args.enforce_eager
+                enforce_eager=args.enforce_eager,
             )
             cached_vllm.start()
         else:
             az_vllm = AzVllm(
-                model_url = args.model_url,
-                port = args.vllm_port,
-                device_id = args.device_id,
-                max_n_images = args.max_n_images,
-                dtype = args.dtype,
-                enforce_eager = args.enforce_eager)
+                model_url=args.model_url,
+                port=args.vllm_port,
+                device_id=args.device_id,
+                max_n_images=args.max_n_images,
+                dtype=args.dtype,
+                enforce_eager=args.enforce_eager,
+            )
             az_vllm.__enter__()
             app.state.resolved_model_path = args.model_url
             app.state.model_name = _extract_model_name(args.model_url)
@@ -239,7 +262,7 @@ async def lifespan(app: FastAPI):
         app.state.model_name = None
 
 
-app = FastAPI(lifespan = lifespan)
+app = FastAPI(lifespan=lifespan)
 
 
 @app.post("/v1/chat/completions")
@@ -247,21 +270,18 @@ async def post_v1_chat_completions(request: Request):
     body = await request.body()
     async with httpx.AsyncClient() as client:
         resp = await client.post(
-            f'http://localhost:{args.vllm_port}/v1/chat/completions',
+            f"http://localhost:{args.vllm_port}/v1/chat/completions",
             content=body,
             headers=dict(request.headers),
-            timeout=None
+            timeout=None,
         )
     return Response(
-        content=resp.content,
-        status_code=resp.status_code,
-        headers=resp.headers
+        content=resp.content, status_code=resp.status_code, headers=resp.headers
     )
 
 
 @app.get("/model")
 async def get_model():
-
     return {"model": _extract_model_name(args.model_url), "model_url": args.model_url}
 
 
@@ -271,12 +291,37 @@ async def get_model():
     parser.add_argument("--port", type=int, default=5000, help="port")
     parser.add_argument("--vllm_port", type=int, default=5001, help="vllm port")
     parser.add_argument("--device_id", type=str, default="0", help="device id")
-    parser.add_argument("--max_n_images", type=int, default=3, help="Maximum number of images to process")
-    parser.add_argument('--dtype', type=str, choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'], default='auto', help='Data type for VLLM model (default: auto)')
-    parser.add_argument('--enforce_eager', action='store_true', help='Enforce eager execution mode for compatibility')
-    parser.add_argument('--cache', action='store_true', help='Enable caching / local path serving instead of Azure mount')
-    parser.add_argument('--cache_dir', type=str, default=None, help='Directory to cache downloaded models (default: ~/.cache/vllm_models)')
+    parser.add_argument(
+        "--max_n_images",
+        type=int,
+        default=3,
+        help="Maximum number of images to process",
+    )
+    parser.add_argument(
+        "--dtype",
+        type=str,
+        choices=["auto", "half", "float16", "bfloat16", "float", "float32"],
+        default="auto",
+        help="Data type for VLLM model (default: auto)",
+    )
+    parser.add_argument(
+        "--enforce_eager",
+        action="store_true",
+        help="Enforce eager execution mode for compatibility",
+    )
+    parser.add_argument(
+        "--cache",
+        action="store_true",
+        help="Enable caching / local path serving instead of Azure mount",
+    )
+    parser.add_argument(
+        "--cache_dir",
+        type=str,
+        default=None,
+        help="Directory to cache downloaded models (default: ~/.cache/vllm_models)",
+    )
     args = parser.parse_args()
 
     import uvicorn
-    uvicorn.run(app, host="0.0.0.0", port=args.port)
\ No newline at end of file
+
+    uvicorn.run(app, host="0.0.0.0", port=args.port)
diff --git a/src/fara/vllm/vllm_facade.py b/src/fara/vllm/vllm_facade.py
index 73cb8d4..ae745e6 100644
--- a/src/fara/vllm/vllm_facade.py
+++ b/src/fara/vllm/vllm_facade.py
@@ -5,32 +5,39 @@
 import threading
 import time
 
+
 class Status(Enum):
     NotStarted = 0
     Running = 1
     Stopped = 2
 
+
 class VLLM:
-    cmd_template = ' '.join([
-        "python -O -u -m vllm.entrypoints.openai.api_server",
-        "--host={host}",
-        "--port={port}",
-        "--model={model_dir}",
-        "--served-model-name {model_name}",
-        "--tensor-parallel-size {tensor_parallel_size}",
-        "--gpu-memory-utilization 0.95",
-        "--trust-remote-code",
-        "--dtype {dtype}"
-    ])
-    def __init__(self,
-                 model_path,
-                 max_n_images,
-                 device_id = "0",
-                 host = "0.0.0.0",
-                 port = 5000,
-                 model_name = "gpt-4o-mini-2024-07-18",
-                 dtype = "auto",
-                 enforce_eager = False):
+    cmd_template = " ".join(
+        [
+            "python -O -u -m vllm.entrypoints.openai.api_server",
+            "--host={host}",
+            "--port={port}",
+            "--model={model_dir}",
+            "--served-model-name {model_name}",
+            "--tensor-parallel-size {tensor_parallel_size}",
+            "--gpu-memory-utilization 0.95",
+            "--trust-remote-code",
+            "--dtype {dtype}",
+        ]
+    )
+
+    def __init__(
+        self,
+        model_path,
+        max_n_images,
+        device_id="0",
+        host="0.0.0.0",
+        port=5000,
+        model_name="gpt-4o-mini-2024-07-18",
+        dtype="auto",
+        enforce_eager=False,
+    ):
         self.model_path = model_path
         self.device_id = device_id
         self.host = host
@@ -43,10 +50,12 @@ def __init__(self,
             # new versions of vllm require dictionary-like arguments for this
             # see https://docs.vllm.ai/en/latest/configuration/engine_args.html#multimodalconfig
             self.cmd += f" --limit-mm-per-prompt.image {self.max_n_images}"
-        if enforce_eager:  # Most helpful for float32 cases when attention backends are incompatible
+        if (
+            enforce_eager
+        ):  # Most helpful for float32 cases when attention backends are incompatible
             self.cmd += " --enforce-eager"
         self.model_name = model_name
-        self.tensor_parallel_size = len(str(device_id).split(','))
+        self.tensor_parallel_size = len(str(device_id).split(","))
         self.status = Status.NotStarted
         self.process = None
         self.logs = []
@@ -57,12 +66,13 @@ def endpoint(self):
 
     def start(self):
         def _drain(pipe):
-            for line in iter(pipe.readline, ''):
+            for line in iter(pipe.readline, ""):
                 self.logs.append(line)
-                print(line, end='')          
+                print(line, end="")
+
         env = os.environ.copy()
-        env['CUDA_VISIBLE_DEVICES'] = self.device_id
-        env['NCCL_DEBUG'] = "TRACE"
+        env["CUDA_VISIBLE_DEVICES"] = self.device_id
+        env["NCCL_DEBUG"] = "TRACE"
         self.process = subprocess.Popen(
             self.cmd.format(
                 host=self.host,
@@ -70,13 +80,13 @@ def _drain(pipe):
                 model_dir=self.model_path,
                 model_name=self.model_name,
                 tensor_parallel_size=self.tensor_parallel_size,
-                dtype=self.dtype
+                dtype=self.dtype,
             ).split(),
-            stdout = subprocess.PIPE,
-            stderr = subprocess.STDOUT,
-            text = True,
-            shell = False,
-            env = env
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+            shell=False,
+            env=env,
         )
         t = threading.Thread(target=_drain, args=(self.process.stdout,), daemon=True)
         t.start()
@@ -88,9 +98,9 @@ def _drain(pipe):
             if "Application startup complete." in line:
                 logging.info("VLLM process started successfully.")
                 self.status = Status.Running
-                return True        
-    
+                return True
+
     def stop(self):
         if self.process:
             self.process.terminate()
-        self.status = Status.Stopped
\ No newline at end of file
+        self.status = Status.Stopped