Apple Silicon + Windows CUDA perf: 4-5x FPS, wider capture, platform routing#1775
Merged
Conversation
…uting Bundles CoreML graph rewrites, GPU-accelerated pipeline work, Windows CUDA fixes, and Mac/Windows runtime routing into a single drop. CoreML (Apple Silicon): - Decompose Pad(reflect) → Slice+Concat in inswapper_128 so the model runs in one CoreML partition instead of 14 (TEMPORARY: fixed upstream in microsoft/onnxruntime#28073, drop when ORT >= 1.26.0). - Fold Shape/Gather chains to constants in det_10g (21ms → 4ms). - Decompose Split(axis=1) → Slice pairs in GFPGAN (155ms → 89ms). - Route detection model to GPU so the ANE is free for the swap model. - Centralize provider/config selection in create_onnx_session. Pipeline (all platforms): - Parallelize face landmark + recognition post-detection; skip landmark_2d_106 when only face_swapper is active. - Pipeline face detection with swap for ANE overlap. - GPU-accelerated paste_back, MJPEG capture, zero-copy display path. - Standalone pipeline benchmark script. Windows / CUDA: - CUDA graphs + FP16 model + all-GPU pipeline for 1080p 60 FPS. - Auto-detect GPU provider and fix DLL discovery for Windows CUDA execution. Cross-platform: - platform_info helper for Mac/Windows runtime routing. - GFPGAN 30 fps + MSMF camera 60 fps with adaptive pipeline tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues surfaced in post-squash review of f65aeae: 1. CUDA-graph replay buffers were shared across threads with no lock. `_cuda_graph_swap_inference` mutates module-level ort_input/ort_latent and runs run_with_iobinding — concurrent swap calls on Windows/CUDA could overwrite each other's bound input buffers before replay, producing wrong-face output. Added `_cuda_graph_lock` around the full update/run/read sequence. 2. Face enhancer loop unconditionally broke after the first face, so `many_faces=True` silently enhanced only one face. Also, the single-slot temporal cache would paste the same enhancement onto every target if reused in many-faces mode. Gated the break on `not many_faces_mode` and disabled the cache path in that mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PERFORMANCE.md documents measured gains on MacBook Pro M3 Max vs hacksider/Deep-Live-Cam main@64d3f06: - Face swap only: <5 FPS -> >20 FPS - Face swap + GFPGAN: <2 FPS -> >10 FPS - Camera: 640x480 -> 960x540 MJPEG @ 60fps Breaks down the contributors (camera negotiation, CoreML graph rewrites with before/after op latencies, pipeline overlap, GFPGAN temporal cache, paste-back optimization, platform routing, Windows CUDA path) and how to reproduce. REVIEW_TODOS.md captures 12 findings from two independent reviews (Claude in-tree + Codex second opinion) grouped as Blockers / Should-fix / Consider, each with file:line and suggested fix. The two Blocker/Should-fix items are addressed in the preceding commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Reviewer's GuideThis PR substantially reworks the live video pipeline for Apple Silicon and Windows/CUDA: it adds CoreML-oriented ONNX graph optimization and platform-detection plumbing, optimizes face detection/swap/enhancement to overlap work and minimize copies, negotiates better camera formats and empirically measures FPS, introduces CUDA graph replay and FP16 model selection on NVIDIA, and wires UI/capture logic to use the new paths while documenting performance and open review items. Sequence diagram for live webcam processing with pipelined detection and cached facessequenceDiagram
actor User
participant UI as modules.ui
participant VC as VideoCapturer
participant CapTh as _capture_thread_func
participant ProcTh as _processing_thread_func
participant FA as face_analyser
participant FS as face_swapper
participant FE as face_enhancer
User->>UI: create_webcam_preview(camera_index)
UI->>VC: start(width=1920,height=1080,fps=60)
VC->>VC: open camera (MSMF/DShow), set MJPG
VC->>VC: measure actual_fps
VC-->>UI: success, actual_fps
UI->>CapTh: start capture thread
UI->>ProcTh: start processing thread(camera_fps)
loop Capture loop
CapTh->>VC: cap.read()
VC-->>CapTh: frame
CapTh->>CapTh: queue.put_nowait(frame) (drop oldest if full)
end
loop Processing loop per frame
ProcTh->>CapTh: capture_queue.get()
CapTh-->>ProcTh: temp_frame (BGR)
ProcTh->>ProcTh: update det_count
alt det_count % det_interval == 0
alt many_faces
ProcTh->>FA: detect_many_faces_fast(frame)
FA-->>ProcTh: cached_many_faces
else single_face
ProcTh->>FA: detect_one_face_fast(frame)
FA-->>ProcTh: cached_target_face
end
end
ProcTh->>ProcTh: build _cached_faces from cache
alt FE enabled
ProcTh->>FE: process_frame(None,temp_frame,detected_faces=_cached_faces)
FE->>FE: enhance_face(temp_frame,detected_faces)
FE-->>ProcTh: enhanced frame
end
alt FS enabled
ProcTh->>FS: process_frame(source_face,temp_frame,target_face=cached_target_face)
FS->>FS: swap_face(source_face,target_face,temp_frame)
FS-->>ProcTh: swapped frame
end
ProcTh->>ProcTh: cv2.cvtColor(BGR->RGB)
ProcTh->>ProcTh: processed_queue.put_nowait(rgb_frame)
end
loop Display loop via ROOT.after
UI->>UI: processed_queue.get_nowait()
UI->>UI: fit_image_to_size(rgb_frame)
UI->>UI: create CTkImage and update preview_label
end
Class diagram for new and modified core modulesclassDiagram
class platform_info {
<<module>>
+bool IS_WINDOWS
+bool IS_MACOS
+bool IS_LINUX
+bool IS_APPLE_SILICON
+bool HAS_TORCH_CUDA
+List~str~ ONNX_PROVIDERS
+bool HAS_CUDA_PROVIDER
+bool HAS_COREML_PROVIDER
+bool HAS_DML_PROVIDER
+List~(int,int)~ camera_backends()
+str accelerator_label()
+void print_banner()
}
class onnx_optimize {
<<module>>
+bool IS_APPLE_SILICON
+str optimize_for_coreml(model_path,str input_shape)
+bool _fold_shape_gather(model, input_shape)
+bool _decompose_reflect_pad(model)
+bool _decompose_split(model)
+void _preserve_emap_position(model, numpy_helper)
}
class VideoCapturer {
-int device_index
-threading.Thread capture_thread
-threading.Event _frame_ready
-bool is_running
-cv2.VideoCapture cap
+int actual_width
+int actual_height
+float actual_fps
+__init__(device_index:int)
+bool start(width:int, height:int, fps:int)
+void release()
+float _measure_fps(warmup:int, sample:int, fallback:float)
+void set_frame_callback(callback)
}
class FaceAnalyserModule {
<<module>>
+Any FACE_ANALYSER
+threading.Lock FACE_ANALYSER_LOCK
+tuple DET_SIZE
+Any get_face_analyser()
+void _optimize_det_model(fa:Any, providers)
+bool _needs_landmark()
+bool _is_dml()
+list _analyse_faces(frame)
+Any get_one_face(frame)
+Any get_many_faces(frame)
+Any detect_one_face_fast(frame)
+Any detect_many_faces_fast(frame)
}
class FaceSwapperModule {
<<module>>
+Any FACE_SWAPPER
+threading.Lock THREAD_LOCK
+bool _HAS_TORCH_CUDA
+dict _paste_cache
+dict _cuda_graph_session
+threading.Lock _cuda_graph_lock
+Any get_face_swapper()
+void _init_cuda_graph_session(model_path:str, swapper)
+np.ndarray _cuda_graph_swap_inference(blob:np.ndarray, latent:np.ndarray)
+Frame _fast_paste_back(target_img:Frame, bgr_fake:np.ndarray, aimg:np.ndarray, M:np.ndarray)
+Frame swap_face(source_face:Face, target_face:Face, temp_frame:Frame)
+Frame apply_post_processing(current_frame:Frame, swapped_face_bboxes:List)
}
class FaceEnhancerModule {
<<module>>
+onnxruntime.InferenceSession FACE_ENHANCER
+threading.Semaphore THREAD_SEMAPHORE
+bool _HAS_TORCH_CUDA
+dict _enhancer_cache
+dict _enh_live_cache
+int _ENH_INTERVAL
+onnxruntime.InferenceSession get_face_enhancer()
+tuple _align_face(frame:Frame, landmarks:np.ndarray, output_size:int)
+Frame _paste_back(frame:Frame, enhanced_face:np.ndarray, affine_matrix:np.ndarray, output_size:int)
+np.ndarray _preprocess_face(aligned_face:np.ndarray)
+np.ndarray _postprocess_face(output:np.ndarray)
+Frame enhance_face(temp_frame:Frame, detected_faces)
+Frame process_frame(source_face:Face, temp_frame:Frame, detected_faces)
+Frame process_frame_v2(temp_frame:Frame, detected_faces)
}
class OnnxEnhHelperModule {
<<module>>
+list build_provider_config(providers)
+np.ndarray run_inference(session, input_name:str, input_tensor:np.ndarray)
+onnxruntime.InferenceSession create_onnx_session(model_path:str)
}
platform_info --> VideoCapturer : camera_backends()
platform_info --> FaceSwapperModule : IS_APPLE_SILICON, HAS_TORCH_CUDA
platform_info --> FaceEnhancerModule : IS_APPLE_SILICON
onnx_optimize --> FaceSwapperModule : optimize_for_coreml()
onnx_optimize --> FaceAnalyserModule : optimize_for_coreml()
onnx_optimize --> OnnxEnhHelperModule : optimize_for_coreml()
FaceAnalyserModule --> FaceSwapperModule : detect_one_face_fast()
FaceAnalyserModule --> FaceEnhancerModule : get_many_faces()
OnnxEnhHelperModule --> FaceEnhancerModule : create_onnx_session()
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In the pipelined
_run_pipe_pipelinepath you pass the sameframeobject toget_one_facevia a backgroundThreadPoolExecutorwhile also mutatingframethrough the frame processors in the main loop; consider either copying the array for the detection task or doing detection on a separate frame buffer to avoid subtle data races and inconsistent face boxes. - The CUDA-graph integration monkey-patches
swapper.session.runin_init_cuda_graph_session, which is brittle if insightface ever recreates or swaps out the session; it would be safer to wrap the call site (e.g. inswap_face) or add a thin adapter method on the swapper rather than replacingsession.runin-place.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the pipelined `_run_pipe_pipeline` path you pass the same `frame` object to `get_one_face` via a background `ThreadPoolExecutor` while also mutating `frame` through the frame processors in the main loop; consider either copying the array for the detection task or doing detection on a separate frame buffer to avoid subtle data races and inconsistent face boxes.
- The CUDA-graph integration monkey-patches `swapper.session.run` in `_init_cuda_graph_session`, which is brittle if insightface ever recreates or swaps out the session; it would be safer to wrap the call site (e.g. in `swap_face`) or add a thin adapter method on the swapper rather than replacing `session.run` in-place.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…raph monkey-patch - core._run_pipe_pipeline: hand the background detector its own copy of the frame. The frame processors mutate in place via paste-back, which was racing with concurrent face detection on the same buffer. - face_swapper._init_cuda_graph_session: replace the `swapper.session.run` monkey-patch with a `_CudaGraphSessionAdapter` that proxies every attribute to the underlying session and only overrides `.run()`. Guarded so repeat init does not double-wrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
|
Awesome! You're the man @maxwbuckley ! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pad(reflect)→Slice+Concat(inswapper_128),Shape→Gatherfolded to constants (det_10g),Split(axis=1)→Slicepairs (GFPGAN). All cached to disk with_coremlsuffix, one-time cost per model per machine.MJPGfourcc + 960×540 @ 60 fps and measures actual FPS empirically (CAP_PROP_FPSlies on DirectShow).nvidia-*wheels.modules/platform_info.pycentralizes OS/accelerator detection with a startup banner confirming which code path the app took.Measured gains (MacBook Pro M3 Max vs upstream
main@64d3f06)Full per-contributor breakdown with before/after op latencies in
PERFORMANCE.md.Known issues (from post-review)
Two independent code reviews (Claude in-tree + Codex second opinion) produced 12 findings, cataloged in
REVIEW_TODOS.mdgrouped as Blockers / Should-fix / Consider. The two highest-severity items are fixed in this PR (CUDA-graph replay race,many_facesenhancer loop breaking after the first face). The remaining 10 items are correctness hardening that should be addressed in follow-ups — none are merge-blockers for this PR's claimed wins.Future cleanup
_decompose_reflect_padinmodules/onnx_optimize.pyis markedTODO(ort>=1.26)— deletable once the ORT floor hits 1.26.0 (fixed upstream by microsoft/onnxruntime#28073). Code-only deletion, no perf change: native MILpad(mode="reflect")matches the Slice+Concat rewrite to within noise (27.2 vs 27.4 ms on this machine).Important
All numbers in this PR were measured on Apple Silicon (M3 Max). The Windows/CUDA code paths (CUDA graphs, FP16 model selection, MSMF→DSHOW camera fallback, DLL discovery for torch/lib + nvidia-*) need end-to-end reverify on a Windows + NVIDIA machine before this is ready for upstream merge.
Specifically check:
_cuda_graph_lockaddition (commit4d04e83).CAP_MSMF→CAP_DSHOWfallback opens the camera.run.pyfinds cuDNN from bothtorch/liband pip-installednvidia-*wheels.multi_process_frame(the CUDA-graph lock should prevent them; confirm empirically).Test plan
[VideoCapturer] 960x540 @ ~60fps, inswapper runs as one CoreML partition, face swap >20 FPS, face swap + GFPGAN >10 FPSmany_facesmode: GFPGAN enhances every detected face (the fix in4d04e83restores this)🤖 Generated with Claude Code
Summary by Sourcery
Improve real-time face swap/enhancement performance and platform handling across Apple Silicon and Windows/CUDA, including CoreML/CUDA optimizations, camera negotiation, and pipeline overlap.
New Features:
Bug Fixes:
Enhancements:
Build:
Documentation:
Tests:
Chores: