Instructions for improving webcam live frame rate through pipelining (tested on M1 Pro: from 1 to 10 FPS)

Since lots of people are running into low frame rates (1 - 3 FPS) when using the live webcam mode, I thought I'd document a simple solution to improving this. This is especially relevant for MacOS since [GPU acceleration with MPS isn't fully supported](https://github.com/hacksider/Deep-Live-Cam/issues/1373) but might also be relevant to [some Nvidia GPU setups](https://github.com/hacksider/Deep-Live-Cam/issues/1373). It would definitely be useful if you have no GPU (or a very weak GPU) and have to run things on the CPU. 

<details>
<summary>
Explanation of the problem and solution
</summary>

Essentially the live webcam mode is slow because it does everything sequentially and mostly on the CPU meaning it has to skip a lot of frames while it is calculating. The problem lies in using just one core of the CPU to first find the face, then align it, then swap it, and that goes on repeat. Most machines have multiple CPU cores and, for a pre-recorded video, they can easily be utilized by computing multiple frames in parallel (which the app already does). However, when streaming in frames live from the webcam, we don't have access to future frames (after all they haven't happened), so we can't start computing them in parallel. We could wait for some to arrive and then compute those in parallel, but that introduces a delay/latency that feels just as bad as low frame rate. 

The solution is [pipelining](https://en.wikipedia.org/wiki/Pipeline_(computing)) (the same technique that your CPU already uses at the instruction level to optimize memory fetching, etc.!). In simplified terms, start finding the face in the first frame as soon as it arrives, then when the second frame arrives start finding the face in it while concurrently aligning the face from the first frame, then swap the face in the first frame while concurrently aligning the face in the second frame and finding a face in the third frame, and so on. That roughly looks like this:

<img width="1441" height="418" alt="Image" src="https://github.com/user-attachments/assets/add0e2e9-31f6-4d56-9649-797701d04513" />

In this simplified representation, the pipelined approach finishes 3 frames in the same time that the sequential approach finishes 1.67 frames, both with the same latency (time between something happens in the camera feed and the corresponding change is rendered in the face swapped stream). Of course in reality there are many more tasks than just `find`, `align`, and `swap`, plus not all the tasks take the same amount of time. This means they won't fit as nicely between the available CPU cores. Moreover, the frames definitely won't be streaming in at a rate that lines up with when the CPU cores are done processing them, so the frames must be skipped and distributed smartly across available cores to result in an even stream that doesn't freeze and jump in time. This can be done by keeping track of the moving average of computation time per frame and distributing the CPU cores evenly across frames from that timespan.
</details>

To try this out for yourself, set up the repo following [the manual installation](https://github.com/hacksider/Deep-Live-Cam?tab=readme-ov-file#installation-manual), open `modules/ui.py` and replace the `create_webcam_preview` function with the following code:

<details>
<summary>
The code that makes the live webcam mode use pipelining
</summary>

```python
import threading
import queue


def process_frame_pipeline(frame, source_image, frame_processors, width, height):
    temp_frame = frame.copy()

    if modules.globals.live_mirror:
        temp_frame = cv2.flip(temp_frame, 1)

    temp_frame = fit_image_to_size(
        temp_frame,
        width,
        height,
    )

    if not modules.globals.map_faces:
        for frame_processor in frame_processors:
            if frame_processor.NAME == "DLC.FACE-ENHANCER":
                if modules.globals.fp_ui["face_enhancer"]:
                    temp_frame = frame_processor.process_frame(None, temp_frame)
            else:
                temp_frame = frame_processor.process_frame(source_image, temp_frame)
    else:
        modules.globals.target_path = None
        for frame_processor in frame_processors:
            if frame_processor.NAME == "DLC.FACE-ENHANCER":
                if modules.globals.fp_ui["face_enhancer"]:
                    temp_frame = frame_processor.process_frame_v2(temp_frame)
            else:
                temp_frame = frame_processor.process_frame_v2(temp_frame)

    return temp_frame


def create_webcam_preview(camera_index: int):
    global preview_label, PREVIEW

    if not modules.globals.source_path:
        update_status("Please select a source image first")
        return

    cap = VideoCapturer(camera_index)
    if not cap.start(PREVIEW_DEFAULT_WIDTH, PREVIEW_DEFAULT_HEIGHT, 60):
        update_status("Failed to start camera")
        return

    preview_label.configure(width=PREVIEW_DEFAULT_WIDTH, height=PREVIEW_DEFAULT_HEIGHT)
    PREVIEW.deiconify()

    frame_processors = get_frame_processors_modules(modules.globals.frame_processors)
    source_image = get_one_face(cv2.imread(modules.globals.source_path))
    prev_time = time.perf_counter()
    fps_update_interval = 0.5
    frame_count = 0
    fps = 0
    NUM_WORKERS = modules.globals.execution_threads
    QUEUE_SIZE = 1

    frame_queue = queue.Queue(maxsize=QUEUE_SIZE)
    result_queue = queue.Queue(maxsize=QUEUE_SIZE)

    stop_event = threading.Event()

    def worker(
        frame_queue: queue.Queue,
        result_queue: queue.Queue,
        stop_event: threading.Event,
        width,
        height,
    ):
        while not stop_event.is_set():
            try:
                frame, start = frame_queue.get(timeout=0.1)
            except queue.Empty:
                continue

            processed = process_frame_pipeline(
                frame, source_image, frame_processors, width, height
            )

            result_queue.put((processed, time.perf_counter() - start))

    # Launch two staggered workers
    workers = [
        threading.Thread(
            target=worker,
            args=[
                frame_queue,
                result_queue,
                stop_event,
                PREVIEW.winfo_width(),
                PREVIEW.winfo_height(),
            ],
            daemon=True,
        )
        for _ in range(NUM_WORKERS)
    ]
    for w in workers:
        w.start()

    current_time = time.perf_counter()
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        try:
            if frame_queue.empty() or (
                time.perf_counter() - current_time >= (1 / (fps + 1))
            ):
                frame_queue.put_nowait((frame, time.perf_counter()))
                # TODO: maintain order
        except queue.Full:
            pass  # drop frame if workers busy

        try:
            temp_frame, latency = result_queue.get_nowait()
            # print(latency)
        except queue.Empty:
            continue

        # FPS calculation
        current_time = time.perf_counter()
        frame_count += 1
        if current_time - prev_time >= fps_update_interval:
            fps = frame_count / (current_time - prev_time)
            frame_count = 0
            prev_time = current_time

        if modules.globals.show_fps:
            cv2.putText(
                temp_frame,
                f"FPS: {fps:.1f}",
                (10, 30),
                cv2.FONT_HERSHEY_SIMPLEX,
                1,
                (0, 255, 0),
                2,
            )

        # Render
        image = cv2.cvtColor(temp_frame, cv2.COLOR_BGR2RGB)
        image = Image.fromarray(image)
        image = ImageOps.contain(
            image, (temp_frame.shape[1], temp_frame.shape[0]), Image.LANCZOS
        )
        image = ctk.CTkImage(image, size=image.size)
        preview_label.configure(image=image)
        ROOT.update()

        if PREVIEW.state() == "withdrawn":
            break

    stop_event.set()
    cap.release()
    PREVIEW.withdraw()
```
</details>

You'll need to experiment a bit with the number of threads you make available depending on your system. On my M1 Pro with 10 cpu cores and 16 GB RAM, about 4 threads seems to be the sweet spot so I run the app with `python run.py --execution-provider coreml --live-mirror --execution-threads 4`. Feel free to share what works with your setup to help others find a good number of threads.

<details>
<summary>
Of course this is a rather primitive fix. There are a number of ways it could/should be improved before being merged into the repo.
</summary>

For one, it currently does not do any sequencing of frames, so occasionally the output feed will jump back/forth in time. For another, it only uses pipelining with either CPU cores or GPU cores, not both. A quick improvement would be combining both CPU and GPU cores. Finally, only FPS is improved for a smoother stream. The latency is not improved. For that, less work must be done per frame, either by reusing work (and doing quick estimates of some of the values from previous frames) or making it faster (e.g., quantization, parallelizing at the model graph level).
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Instructions for improving webcam live frame rate through pipelining (tested on M1 Pro: from 1 to 10 FPS) #1495

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Instructions for improving webcam live frame rate through pipelining (tested on M1 Pro: from 1 to 10 FPS) #1495

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions