Skip to content

Conversation

@cavusmustafa
Copy link
Owner

@cavusmustafa cavusmustafa commented Oct 15, 2025

  • Some updates in CMakeLists and requirements to resolve build/export issues with latest version
  • New pipeline which reduces idle time for each stage and improves overall throughput. While the NPU or GPU is busy performing inference on one frame, the CPU can simultaneously preprocess the next frame and postprocess the previous one.
FPS
XNNPACK 3.5
CPU FP32 6.9
CPU INT8 13.8
GPU FP16 52.3
NPU FP16 64.5

CPU: Intel(R) Core(TM) Ultra 5 238V
Model: Yolo12s
Model input size: 640x640

Comment on lines +32 to +34
find_package(absl CONFIG REQUIRED PATHS ${EXECUTORCH_ROOT}/cmake-out)
find_package(re2 CONFIG REQUIRED PATHS ${EXECUTORCH_ROOT}/cmake-out)
find_package(tokenizers CONFIG REQUIRED PATHS ${EXECUTORCH_ROOT}/cmake-out)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need tokenizers and other dependencies for YOLO?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to build it without these? The yolo example doesn't use them but I thought some dependencies need them. I will check again if we can build without these.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without these I see the error below:

CMake Error at CMakeLists.txt:37 (find_package):
  Found package configuration file:

    /home/mcavus/executorch/executorch/cmake-out/lib/cmake/ExecuTorch/executorch-config.cmake

  but it set executorch_FOUND to FALSE so package "executorch" is considered
  to be NOT FOUND.  Reason given by package:

  The following imported targets are referenced, but are missing:
  tokenizers::tokenizers

Copy link

@daniil-lyakhov daniil-lyakhov Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the example was fully functional when it was merged. That's a bit strange, how do you build the example?
You can find a test script over there https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_yolo12.sh

I think meta guys could potentially help with that

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build commands below should work right? I can reproduce with the main branch. I see similar error either with OV backend or XNNPACK by the way. I will ping them in discord.

rm -rf build
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_XNNPACK_BACKEND=OFF -DUSE_OPENVINO_BACKEND=ON ..
make -j$(nproc)

Comment on lines +131 to +138
while (!ready_q.empty() && scale_q.size() < frame_queue_size) {
frame_ctx *scale_f = ready_q.front();
scale_q.push(std::make_pair(scale_f, std::async(std::launch::async, scale_with_padding, std::ref(scale_f->frame), &(scale_f->pad_x), &(scale_f->pad_y), &(scale_f->scale), img_dims)));
ready_q.pop();
}
const et_timestamp_t after_execute = et_pal_current_ticks();
time_spent_executing += after_execute - before_execute;
iters++;

if (!(iters % progress_bar_tick)) {
const int precent_ready = (100 * iters) / video_lenght;
std::cout << iters << " out of " << video_lenght
<< " frames are are processed (" << precent_ready << "\%)"
<< std::endl;
while (!scale_q.empty() && input_q.size() < frame_queue_size) {
auto status = scale_q.front().second.wait_for(std::chrono::milliseconds(1));
if (status == std::future_status::ready) {
Copy link

@daniil-lyakhov daniil-lyakhov Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General questions:

  1. Looks like you are implementing inference request queue, is it possible to utilize the standard openvino API somehow?
  2. This is a real-time demo, the data is streamed sequentially and should be shoved sequentially, how does it work with your updated?
  3. I believe it is unfair to collect only model inference time without pre and post processing and claim it as a FPS stats. In real application the pre and post processing will affect the FPS

In general - could you please state the motives behind this PR? What are the purpose and improvements this PR introducing?

Copy link
Owner Author

@cavusmustafa cavusmustafa Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We could try using async call inside openvino backend. This way, we could simply call forward function from executorch application and let openvino schedule the tasks. But I found two issues with it (explained below). But I don't think we need it anyways. The model inference still executes sequentially as we have a mutex lock on that part and we don't need to execute model inference asynchronously for this use case. I explained it more in 2.
    • We claim to support xnnpack with this application as well. We may need to add a lot of customizations only for openvino in that case.
    • Single executorch module seems to be using same output buffer for all executions. Upcoming task may override the result of the previous task. This seems to be risky (and it fails for xnnpack). We can create multiple executorch modules but in that case, I don't know if it would share the same openvino backend object. If it doesn't we may not be able to use async execution as intended and we may have additional memory overheads.
  2. We can assume the data is streamed sequentially and shoved sequentially. But still we can use pipelining for preprocess, infer, and postprocess which was the intention in this PR. So, as the first frame completes preprocessing on CPU and starts model execution on GPU (or NPU), we can also start preprocessing for the second frame as long as it is ready. Once the first frame completes GPU process, the second frame task can be assigned to GPU while the first frame start postprocessing.
    Also, in a real time stream, it will be better to limit the size of ready queue (maybe 2 or even 1). Larger ready queue size can cause delays in the output video.
  3. I didn't understand this part. The time measurement should be already for end-to-end object detection process (collecting timing before and after the whole while loop). It increases iters only when a frame retires. At the end it calculates the timing based on total while loop timing and total number of frames retired.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants