[Feature] Support AsyncLLM #4458

xyxinyang · 2025-10-16T11:21:15Z

实现了异步引擎 async_llm，支持并发请求。

paddle-bot · 2025-10-16T11:21:21Z

Thanks for your contribution!

Copilot

Pull Request Overview

Adds a new asynchronous LLM engine with streaming output and concurrency support, and wires it into the scheduling loop; also provides integration tests for concurrent generation.

Introduces AsyncLLMEngine with async request submission, output processing, and graceful shutdown.
Updates scheduler to be more robust when the request-fetch thread pool is shut down.
Adds tests validating single and concurrent prompt generation via the async engine.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File	Description
tests/engine/test_async_llm.py	Adds integration-style async tests for single and concurrent generations, plus setup/teardown hygiene.
fastdeploy/engine/common_engine.py	Makes scheduler thread-pool submission resilient around shutdown; minor logging in scheduler loop.
fastdeploy/engine/async_llm.py	Implements the new AsyncLLMEngine, async output routing, worker/process lifecycle management, and request handling.

Copilot · 2025-10-16T11:30:24Z

fastdeploy/engine/async_llm.py

+        # Create output queue
+        output_queue = AsyncRequestQueue(request_id)
+
+        if request_id is None:
+            request_id = str(uuid.uuid4())


The request_id is generated after the AsyncRequestQueue is created, leaving output_queue.request_id as None while the registered key uses the generated UUID. This also causes abort_request(request_id) in generate() to use None when the caller doesn't pass a request_id, so the request cannot be aborted properly. Generate the request_id before creating the queue and ensure generate() uses the effective request_id.

Suggested change

# Create output queue

output_queue = AsyncRequestQueue(request_id)

if request_id is None:

request_id = str(uuid.uuid4())

# Ensure request_id is set before creating output queue

if request_id is None:

request_id = str(uuid.uuid4())

output_queue = AsyncRequestQueue(request_id)

Copilot · 2025-10-16T11:30:24Z

fastdeploy/engine/async_llm.py

+        current_file_path = os.path.abspath(__file__)
+        current_dir_path = os.path.split(current_file_path)[0]
+        # TODO
+        uncache_worker_stdout = "" if os.getenv("UNCACHE_WORKER_STDOUT", "0") == 1 else "-u"


os.getenv returns a string; comparing it to the integer 1 always evaluates to False, so -u is always used. Compare against the string '1' or cast to int.

Suggested change

uncache_worker_stdout = "" if os.getenv("UNCACHE_WORKER_STDOUT", "0") == 1 else "-u"

uncache_worker_stdout = "" if os.getenv("UNCACHE_WORKER_STDOUT", "0") == "1" else "-u"

Copilot · 2025-10-16T11:30:25Z

fastdeploy/engine/async_llm.py

+                    self.worker_init_status["weight_loadding"] = eval(match.group(1)) * 1.0 / 100
+                elif (match := re.search(r"Start load layer (\d+)", line)) or (
+                    match := re.search(r"set state for layer (\d+)", line)
+                ):
+                    progress = eval(match.group(1)) * 1.0 / self.cfg.model_config.num_hidden_layers


Avoid eval() on text parsed from logs; use int(match.group(1)) instead to prevent code execution risk.

Suggested change

self.worker_init_status["weight_loadding"] = eval(match.group(1)) * 1.0 / 100

elif (match := re.search(r"Start load layer (\d+)", line)) or (

match := re.search(r"set state for layer (\d+)", line)

):

progress = eval(match.group(1)) * 1.0 / self.cfg.model_config.num_hidden_layers

self.worker_init_status["weight_loadding"] = int(match.group(1)) * 1.0 / 100

elif (match := re.search(r"Start load layer (\d+)", line)) or (

match := re.search(r"set state for layer (\d+)", line)

):

progress = int(match.group(1)) * 1.0 / self.cfg.model_config.num_hidden_layers

Copilot · 2025-10-16T11:30:25Z

fastdeploy/engine/async_llm.py

+                    self.worker_init_status["weight_loadding"] = eval(match.group(1)) * 1.0 / 100
+                elif (match := re.search(r"Start load layer (\d+)", line)) or (
+                    match := re.search(r"set state for layer (\d+)", line)
+                ):
+                    progress = eval(match.group(1)) * 1.0 / self.cfg.model_config.num_hidden_layers


Avoid eval() on text parsed from logs; use int(match.group(1)) instead to prevent code execution risk.

Suggested change

self.worker_init_status["weight_loadding"] = eval(match.group(1)) * 1.0 / 100

elif (match := re.search(r"Start load layer (\d+)", line)) or (

match := re.search(r"set state for layer (\d+)", line)

):

progress = eval(match.group(1)) * 1.0 / self.cfg.model_config.num_hidden_layers

self.worker_init_status["weight_loadding"] = int(match.group(1)) * 1.0 / 100

elif (match := re.search(r"Start load layer (\d+)", line)) or (

match := re.search(r"set state for layer (\d+)", line)

):

progress = int(match.group(1)) * 1.0 / self.cfg.model_config.num_hidden_layers

Copilot · 2025-10-16T11:30:25Z

fastdeploy/engine/common_engine.py

+                        # Check if the thread pool is still available to avoid submitting tasks to a shutdown thread pool.
+                        try:
+                            if not get_request_pool._shutdown:
+                                get_request_pool.submit(_fetch_request)
+                        except RuntimeError as e:
+                            if "shutdown" in str(e):
+                                llm_logger.info("Thread pool shutdown detected, exiting scheduler loop")


Accessing the private attribute _shutdown on ThreadPoolExecutor is brittle. Rely on catching RuntimeError from submit() and use the instance logger for consistency. Suggested change: remove the _shutdown check and replace llm_logger with self.llm_logger.

Suggested change

# Check if the thread pool is still available to avoid submitting tasks to a shutdown thread pool.

try:

if not get_request_pool._shutdown:

get_request_pool.submit(_fetch_request)

except RuntimeError as e:

if "shutdown" in str(e):

llm_logger.info("Thread pool shutdown detected, exiting scheduler loop")

# Try to submit the task; catch RuntimeError if the pool is shut down.

try:

get_request_pool.submit(_fetch_request)

except RuntimeError as e:

if "shutdown" in str(e):

self.llm_logger.info("Thread pool shutdown detected, exiting scheduler loop")

Copilot · 2025-10-16T11:30:26Z

fastdeploy/engine/async_llm.py

+        Creates an LLM engine from the provided engine arguments.
+
+        Args:
+            engine_args (EngineArgs): Engine arguments object.
+
+        Returns:
+            LLMEngine: Instance of the LLMEngine class.


The return type in the docstring is incorrect; this factory returns AsyncLLMEngine, not LLMEngine.

Suggested change

Creates an LLM engine from the provided engine arguments.

Args:

engine_args (EngineArgs): Engine arguments object.

Returns:

LLMEngine: Instance of the LLMEngine class.

Creates an AsyncLLMEngine from the provided engine arguments.

Args:

engine_args (EngineArgs): Engine arguments object.

Returns:

AsyncLLMEngine: Instance of the AsyncLLMEngine class.

Copilot · 2025-10-16T11:30:26Z

fastdeploy/engine/async_llm.py

+        self.checking_worker_status_thread.start()
+
+        # display weight loadding progress
+        with tqdm(total=100, desc="Loading Weights") as pbar:


Correct spelling of 'loadding' to 'loading' in related variable names within this section for clarity and consistency.

Copilot · 2025-10-16T11:30:26Z

fastdeploy/engine/async_llm.py

+                    return False
+
+        # display layer loadding progress
+        with tqdm(total=100, desc="Loading Layers") as pbar:


Correct spelling of 'loadding' to 'loading' in related variable names within this section for clarity and consistency.

Copilot · 2025-10-16T11:30:26Z

fastdeploy/engine/async_llm.py

+            llm_logger.info(f"Request {request_id} generator exit (outer)")
+            return
+        except Exception as e:
+            await self.abort_request(request_id)


On exception, generate() aborts using the caller-supplied request_id, which may be None when the engine generated the ID internally. Capture and use the effective request_id (e.g., from the result of add_request or output_queue.request_id) to ensure the correct request is aborted.

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

Copilot · 2025-10-16T12:09:10Z

fastdeploy/engine/async_llm.py

+        try:
+            # Ensure output processor is running
+            self._start_output_handler()
+
+            # Async add request
+            output_queue = await self.add_request(request_id, prompt, sampling_params, **kwargs)
+
+            finished = False
+
+            while not finished:
+                # Prefer non-blocking get first
+                output = output_queue.get_nowait() or await output_queue.get()
+                finished = output.finished
+                yield output
+
+        except EngineError:
+            raise
+        except GeneratorExit:
+            llm_logger.info(f"Request {request_id} generator exit (outer)")
+            return
+        except Exception as e:
+            await self.abort_request(request_id)
+            llm_logger.error(f"Request {request_id} failed: {e}")
+            raise EngineError(str(e), error_code=500) from e


If request_id is None (common case), add_request will generate a new ID, but generate() continues to log and abort using the original None value. This prevents aborting the actual request and produces misleading logs. Assign the actual ID after add_request, e.g., request_id = output_queue.request_id, and use it for logging/abort.

Copilot · 2025-10-16T12:09:10Z

fastdeploy/engine/async_llm.py

+                    self.worker_init_status["weight_loadding"] = eval(match.group(1)) * 1.0 / 100
+                elif (match := re.search(r"Start load layer (\d+)", line)) or (
+                    match := re.search(r"set state for layer (\d+)", line)
+                ):
+                    progress = eval(match.group(1)) * 1.0 / self.cfg.model_config.num_hidden_layers


Using eval on log-derived text is unsafe and unnecessary. Replace eval(match.group(1)) with int(match.group(1)) to parse integers safely.

Suggested change

self.worker_init_status["weight_loadding"] = eval(match.group(1)) * 1.0 / 100

elif (match := re.search(r"Start load layer (\d+)", line)) or (

match := re.search(r"set state for layer (\d+)", line)

):

progress = eval(match.group(1)) * 1.0 / self.cfg.model_config.num_hidden_layers

self.worker_init_status["weight_loadding"] = int(match.group(1)) * 1.0 / 100

elif (match := re.search(r"Start load layer (\d+)", line)) or (

match := re.search(r"set state for layer (\d+)", line)

):

progress = int(match.group(1)) * 1.0 / self.cfg.model_config.num_hidden_layers

Copilot · 2025-10-16T12:09:11Z

fastdeploy/engine/async_llm.py

+    def check_health(self, time_interval_threashold=30):
+        """
+        Check the health of the model server by checking whether all workers are alive.
+
+        """
+        if self.engine_service.worker_healthy_live_signal.value[0]:
+            elapsed_time = time.time() - self.engine_service.worker_healthy_live_signal.value[0]
+            if elapsed_time > time_interval_threashold:


Correct the parameter name typo 'time_interval_threashold' to 'time_interval_threshold'.

Suggested change

def check_health(self, time_interval_threashold=30):

"""

Check the health of the model server by checking whether all workers are alive.

"""

if self.engine_service.worker_healthy_live_signal.value[0]:

elapsed_time = time.time() - self.engine_service.worker_healthy_live_signal.value[0]

if elapsed_time > time_interval_threashold:

def check_health(self, time_interval_threshold=30):

"""

Check the health of the model server by checking whether all workers are alive.

"""

if self.engine_service.worker_healthy_live_signal.value[0]:

elapsed_time = time.time() - self.engine_service.worker_healthy_live_signal.value[0]

if elapsed_time > time_interval_threshold:

Copilot · 2025-10-16T12:09:11Z

fastdeploy/engine/async_llm.py

+
+                    if not results:
+                        # No results, minimal delay to yield control
+                        await asyncio.sleep(0)


Using await asyncio.sleep(0) can create a near-busy loop and waste CPU. Consider a small backoff (e.g., await asyncio.sleep(0.001)) or a condition-based await to reduce spin.

Suggested change

await asyncio.sleep(0)

await asyncio.sleep(0.001)

Copilot · 2025-10-16T12:09:11Z

fastdeploy/engine/async_llm.py

+        pd_cmd = pd_cmd + arguments + f" 2>{log_dir}/launch_worker.log"
+        llm_logger.info(f"Launch worker service command: {pd_cmd}")
+        p = subprocess.Popen(
+            pd_cmd,
+            stdout=subprocess.PIPE,
+            shell=True,


Avoid shell=True with string commands where possible to reduce injection risk. Build an argument list and invoke Popen with shell=False (e.g., use a list for python, -m, launch, flags, and script arguments) or carefully sanitize inputs.

Suggested change

pd_cmd = pd_cmd + arguments + f" 2>{log_dir}/launch_worker.log"

llm_logger.info(f"Launch worker service command: {pd_cmd}")

p = subprocess.Popen(

pd_cmd,

stdout=subprocess.PIPE,

shell=True,

# Build the command as a list of arguments

pd_cmd_list = pd_cmd.split()

if self.cfg.nnode > 1:

pd_cmd_list += ["--ips", str(ips), "--nnodes", str(len(self.cfg.ips))]

for worker_flag, value in worker_append_flag.items():

if value:

pd_cmd_list.append(f"--{worker_flag}")

# Open the log file for stderr redirection

log_file = open(f"{log_dir}/launch_worker.log", "w")

llm_logger.info(f"Launch worker service command: {' '.join(pd_cmd_list)}")

p = subprocess.Popen(

pd_cmd_list,

stdout=subprocess.PIPE,

stderr=log_file,

shell=False,

Copilot · 2025-10-16T12:09:12Z

fastdeploy/engine/common_engine.py

+                            get_request_pool.submit(_fetch_request)
+                        except RuntimeError as e:
+                            if "shutdown" in str(e):
+                                llm_logger.info("Thread pool shutdown detected, exiting scheduler loop")


[nitpick] Use the instance logger self.llm_logger for consistency with the rest of EngineService instead of the module-level llm_logger.

Suggested change

llm_logger.info("Thread pool shutdown detected, exiting scheduler loop")

self.llm_logger.info("Thread pool shutdown detected, exiting scheduler loop")

add async_llm

d56bd5a

paddle-bot bot added the contributor External developers label Oct 16, 2025

Jiang-Jia-Jun requested a review from Copilot October 16, 2025 11:26

Copilot AI reviewed Oct 16, 2025

View reviewed changes

apply review

b6c467e

Jiang-Jia-Jun requested a review from Copilot October 16, 2025 12:03

Copilot AI reviewed Oct 16, 2025

View reviewed changes

Jiang-Jia-Jun changed the title ~~add async_llm~~ [Feature] Support AsyncLLM Oct 16, 2025

	uncache_worker_stdout = "" if os.getenv("UNCACHE_WORKER_STDOUT", "0") == 1 else "-u"
	uncache_worker_stdout = "" if os.getenv("UNCACHE_WORKER_STDOUT", "0") == "1" else "-u"

-        pd_cmd = pd_cmd + arguments + f" 2>{log_dir}/launch_worker.log"
-        llm_logger.info(f"Launch worker service command: {pd_cmd}")
-        p = subprocess.Popen(
-            pd_cmd,
-            stdout=subprocess.PIPE,
-            shell=True,
+        # Build the command as a list of arguments
+        pd_cmd_list = pd_cmd.split()
+        if self.cfg.nnode > 1:
+            pd_cmd_list += ["--ips", str(ips), "--nnodes", str(len(self.cfg.ips))]
+        for worker_flag, value in worker_append_flag.items():
+            if value:
+                pd_cmd_list.append(f"--{worker_flag}")
+        # Open the log file for stderr redirection
+        log_file = open(f"{log_dir}/launch_worker.log", "w")
+        llm_logger.info(f"Launch worker service command: {' '.join(pd_cmd_list)}")
+        p = subprocess.Popen(
+            pd_cmd_list,
+            stdout=subprocess.PIPE,
+            stderr=log_file,
+            shell=False,

	llm_logger.info("Thread pool shutdown detected, exiting scheduler loop")
	self.llm_logger.info("Thread pool shutdown detected, exiting scheduler loop")

[Feature] Support AsyncLLM #4458

Are you sure you want to change the base?

[Feature] Support AsyncLLM #4458

Uh oh!

Conversation

xyxinyang commented Oct 16, 2025

Uh oh!

paddle-bot bot commented Oct 16, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant