-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Since lots of people are running into low frame rates (1 - 3 FPS) when using the live webcam mode, I thought I'd document a simple solution to improving this. This is especially relevant for MacOS since GPU acceleration with MPS isn't fully supported but might also be relevant to some Nvidia GPU setups. It would definitely be useful if you have no GPU (or a very weak GPU) and have to run things on the CPU.
Explanation of the problem and solution
Essentially the live webcam mode is slow because it does everything sequentially and mostly on the CPU meaning it has to skip a lot of frames while it is calculating. The problem lies in using just one core of the CPU to first find the face, then align it, then swap it, and that goes on repeat. Most machines have multiple CPU cores and, for a pre-recorded video, they can easily be utilized by computing multiple frames in parallel (which the app already does). However, when streaming in frames live from the webcam, we don't have access to future frames (after all they haven't happened), so we can't start computing them in parallel. We could wait for some to arrive and then compute those in parallel, but that introduces a delay/latency that feels just as bad as low frame rate.
The solution is pipelining (the same technique that your CPU already uses at the instruction level to optimize memory fetching, etc.!). In simplified terms, start finding the face in the first frame as soon as it arrives, then when the second frame arrives start finding the face in it while concurrently aligning the face from the first frame, then swap the face in the first frame while concurrently aligning the face in the second frame and finding a face in the third frame, and so on. That roughly looks like this:

In this simplified representation, the pipelined approach finishes 3 frames in the same time that the sequential approach finishes 1.67 frames, both with the same latency (time between something happens in the camera feed and the corresponding change is rendered in the face swapped stream). Of course in reality there are many more tasks than just find
, align
, and swap
, plus not all the tasks take the same amount of time. This means they won't fit as nicely between the available CPU cores. Moreover, the frames definitely won't be streaming in at a rate that lines up with when the CPU cores are done processing them, so the frames must be skipped and distributed smartly across available cores to result in an even stream that doesn't freeze and jump in time. This can be done by keeping track of the moving average of computation time per frame and distributing the CPU cores evenly across frames from that timespan.
To try this out for yourself, set up the repo following the manual installation, open modules/ui.py
and replace the create_webcam_preview
function with the following code:
The code that makes the live webcam mode use pipelining
import threading
import queue
def process_frame_pipeline(frame, source_image, frame_processors, width, height):
temp_frame = frame.copy()
if modules.globals.live_mirror:
temp_frame = cv2.flip(temp_frame, 1)
temp_frame = fit_image_to_size(
temp_frame,
width,
height,
)
if not modules.globals.map_faces:
for frame_processor in frame_processors:
if frame_processor.NAME == "DLC.FACE-ENHANCER":
if modules.globals.fp_ui["face_enhancer"]:
temp_frame = frame_processor.process_frame(None, temp_frame)
else:
temp_frame = frame_processor.process_frame(source_image, temp_frame)
else:
modules.globals.target_path = None
for frame_processor in frame_processors:
if frame_processor.NAME == "DLC.FACE-ENHANCER":
if modules.globals.fp_ui["face_enhancer"]:
temp_frame = frame_processor.process_frame_v2(temp_frame)
else:
temp_frame = frame_processor.process_frame_v2(temp_frame)
return temp_frame
def create_webcam_preview(camera_index: int):
global preview_label, PREVIEW
if not modules.globals.source_path:
update_status("Please select a source image first")
return
cap = VideoCapturer(camera_index)
if not cap.start(PREVIEW_DEFAULT_WIDTH, PREVIEW_DEFAULT_HEIGHT, 60):
update_status("Failed to start camera")
return
preview_label.configure(width=PREVIEW_DEFAULT_WIDTH, height=PREVIEW_DEFAULT_HEIGHT)
PREVIEW.deiconify()
frame_processors = get_frame_processors_modules(modules.globals.frame_processors)
source_image = get_one_face(cv2.imread(modules.globals.source_path))
prev_time = time.perf_counter()
fps_update_interval = 0.5
frame_count = 0
fps = 0
NUM_WORKERS = modules.globals.execution_threads
QUEUE_SIZE = 1
frame_queue = queue.Queue(maxsize=QUEUE_SIZE)
result_queue = queue.Queue(maxsize=QUEUE_SIZE)
stop_event = threading.Event()
def worker(
frame_queue: queue.Queue,
result_queue: queue.Queue,
stop_event: threading.Event,
width,
height,
):
while not stop_event.is_set():
try:
frame, start = frame_queue.get(timeout=0.1)
except queue.Empty:
continue
processed = process_frame_pipeline(
frame, source_image, frame_processors, width, height
)
result_queue.put((processed, time.perf_counter() - start))
# Launch two staggered workers
workers = [
threading.Thread(
target=worker,
args=[
frame_queue,
result_queue,
stop_event,
PREVIEW.winfo_width(),
PREVIEW.winfo_height(),
],
daemon=True,
)
for _ in range(NUM_WORKERS)
]
for w in workers:
w.start()
current_time = time.perf_counter()
while True:
ret, frame = cap.read()
if not ret:
break
try:
if frame_queue.empty() or (
time.perf_counter() - current_time >= (1 / (fps + 1))
):
frame_queue.put_nowait((frame, time.perf_counter()))
# TODO: maintain order
except queue.Full:
pass # drop frame if workers busy
try:
temp_frame, latency = result_queue.get_nowait()
# print(latency)
except queue.Empty:
continue
# FPS calculation
current_time = time.perf_counter()
frame_count += 1
if current_time - prev_time >= fps_update_interval:
fps = frame_count / (current_time - prev_time)
frame_count = 0
prev_time = current_time
if modules.globals.show_fps:
cv2.putText(
temp_frame,
f"FPS: {fps:.1f}",
(10, 30),
cv2.FONT_HERSHEY_SIMPLEX,
1,
(0, 255, 0),
2,
)
# Render
image = cv2.cvtColor(temp_frame, cv2.COLOR_BGR2RGB)
image = Image.fromarray(image)
image = ImageOps.contain(
image, (temp_frame.shape[1], temp_frame.shape[0]), Image.LANCZOS
)
image = ctk.CTkImage(image, size=image.size)
preview_label.configure(image=image)
ROOT.update()
if PREVIEW.state() == "withdrawn":
break
stop_event.set()
cap.release()
PREVIEW.withdraw()
You'll need to experiment a bit with the number of threads you make available depending on your system. On my M1 Pro with 10 cpu cores and 16 GB RAM, about 4 threads seems to be the sweet spot so I run the app with python run.py --execution-provider coreml --live-mirror --execution-threads 4
. Feel free to share what works with your setup to help others find a good number of threads.
Of course this is a rather primitive fix. There are a number of ways it could/should be improved before being merged into the repo.
For one, it currently does not do any sequencing of frames, so occasionally the output feed will jump back/forth in time. For another, it only uses pipelining with either CPU cores or GPU cores, not both. A quick improvement would be combining both CPU and GPU cores. Finally, only FPS is improved for a smoother stream. The latency is not improved. For that, less work must be done per frame, either by reusing work (and doing quick estimates of some of the values from previous frames) or making it faster (e.g., quantization, parallelizing at the model graph level).