Add support for TBB multi-threading backend in addition to OpenMP #1236

p12tic · 2022-09-23T00:06:06Z

Currently AliceVision uses OpenMP as the only multi-threading backend. This is problematic due to multiple reasons.

OpenMP support is not uniform among compilers. In particular, Apple mobile platforms do not support OpenMP. As a result, the performance of the algorithms is as much as 8 times lower than possible.
OpenMP critical sections are global. That is, all #pragma omp critical lock the same global mutex. As a consequence it is inefficient to run multiple instances of the same parallelized aliceVision algorithm because these instances will share the same mutex even though data races are possible only among threads running single instance of the algorithm. Ideally each instance would have its own mutex.
It is not possible to efficiently integrate third-party libraries that use another multi-threading framework because OpenMP assumes that it is the only user of the CPU. As a result, the CPU will be oversubscribed which leads to poor performance. Note that as currently used OpenMP will oversubscribe the CPU all by itself even right now if multiple instances of the same parallelized algorithm are invoked in parallel.

This PR takes inspiration from OpenCV to hide the usage of multi-threading framework behind an API. This will eventually allow supporting multiple multi-threading frameworks. For more details in how it works in OpenCV, see this document.

This PR implements the following:

Migrate off OpenMP synchronization primitives to standard mutexes, atomics and boost::atomic_ref (once we can use C++20 we can migrate to std::atomic_ref).
Move pragma omp parallel uses behind an interface exposing system::parallelFor and system::parallelLoop functions instead.
Implement support for multiple underlying multi-threading backends
Implement support for oneTBB library as the underlying multi-threading backend.

As a result the OpenMP code can be converted as follows. For example:

#pragma omp parallel for
for (int i = 10; i < size; ++i)
{
    doStuff(i);
}

Equivalent implementation of this loop using system::parallelFor is the following:

system::parallelFor(10, size, [&](int i)
{
    doStuff(i);
});

As a result, the performance has been improved in several cases where libgomp implementation of OpenMP currently doesn't handle well (many requests to parallelize relatively small problems). This reduces the need for PRs like #1277 as the multithreading runtime handles a wider variety of tasks. For example, before #1277, the this PR made the following tests faster running on a machine with AMD 2990WX with disabled turbo boost:

test_voctree_vocabularyTreeBuild: before ~5.1-7s (extremely variable), after ~1.4s
test_voctree_kmeans: before ~5.8-10s (extremely variable), after ~4.7s
test_sfm_sequentialSfM: before ~1.7s, after ~1.5s

The PR is split into a large number of commits to allow easy bisection in case a bug slips through. As a result the risk of the PR is low as any bugs will be easily diagnosed and fixed.

fabiencastan · 2022-10-07T09:08:19Z

need a rebase

p12tic · 2022-10-07T21:34:42Z

@fabiencastan Rebased, thanks.

p12tic · 2022-10-11T04:46:09Z

@fabiencastan I've expanded the scope of this PR and it now includes full support for oneTBB multithreading backend. This will allow to run aliceVision with multithreading enabled on macOS.

The function spends more than 99.98% of its time in Estimate_T_triplet() even in tests which presumably operate on smaller datasets than what's in production. It's not worth to complicate code with per-thread accumulator in this case.

This commit is best reviewed with whitespace changes ignored.

p12tic mentioned this pull request Sep 23, 2022

Fix broken openmp atomic usage #1234

Merged

p12tic force-pushed the wrap-openmp branch from 9a5d2d1 to 6bf36af Compare September 23, 2022 00:15

p12tic changed the title ~~Wrap OpenMP invocations in an interface to support other parallelization backends in the future~~ Wrap OpenMP invocations in an interface to support other multi-threading backends in the future Sep 23, 2022

p12tic force-pushed the wrap-openmp branch 5 times, most recently from f903a85 to 54bfb78 Compare September 24, 2022 05:19

p12tic force-pushed the wrap-openmp branch from 54bfb78 to 4b8f286 Compare October 2, 2022 00:18

p12tic force-pushed the wrap-openmp branch 2 times, most recently from dc4760d to 21e395e Compare October 7, 2022 13:12

p12tic force-pushed the wrap-openmp branch 3 times, most recently from 4045093 to 8ade13b Compare October 11, 2022 04:15

p12tic changed the title ~~Wrap OpenMP invocations in an interface to support other multi-threading backends in the future~~ Add support for TBB multi-threading backend in addition to OpenMP Oct 11, 2022

p12tic force-pushed the wrap-openmp branch 2 times, most recently from 2bbcc95 to 5c5da60 Compare October 12, 2022 06:42

p12tic added 10 commits October 20, 2022 10:23

[system] Implement wrapper for openmp parallel constructs

c6de28a

[system] Implement tbb parallelism backend

0f7370a

[camera] Replace openmp parallel for with system::parallelFor()

9adc8f7

[depthMap] Replace openmp parallel for with system::parallelFor()

247054c

[feature] Replace openmp parallel for with system::parallelFor()

41ce13e

[fuseCut] Replace openmp parallel for with system::parallelFor()

a3841a0

[hdr] Replace openmp parallel for with system::parallelFor()

016071d

[image] Replace openmp parallel for with system::parallelFor()

e22616a

[imageMatching] Replace openmp parallel for with system::parallelFor()

9556fd6

[localization] Replace openmp parallel for with system::parallelFor()

0480905

p12tic added 15 commits October 20, 2022 10:23

[matching] Replace openmp parallel for with system::parallelFor()

62e780a

[matchingImageCollection] Replace openmp parallel for with parallelFor()

1c8a5af

[mesh] Replace openmp parallel for with parallelFor()

0993f57

[pipeline] Remove useless per-thread accumulator

f5159f8

The function spends more than 99.98% of its time in Estimate_T_triplet() even in tests which presumably operate on smaller datasets than what's in production. It's not worth to complicate code with per-thread accumulator in this case.

[sfm] Replace openmp parallel for with parallelFor()

52a0999

[sfmData] Replace openmp parallel for with parallelFor()

f3acd65

[track] Replace openmp parallel for with parallelFor()

78b8668

[voctree] Replace openmp parallel for with parallelFor()

783b3cf

[pipeline] Replace openmp parallel for with parallelFor()

c15c6d9

[utils] Replace openmp parallel for with parallelFor()

284c380

[export] Replace openmp parallel for with parallelFor()

6746924

[sfm] Replace openmp parallel with parallelLoop()

aaa2573

This commit is best reviewed with whitespace changes ignored.

[track] Replace openmp parallel with parallelLoop()

9dc27fc

This commit is best reviewed with whitespace changes ignored.

[voctree] Remove unused omp parallel comment

769bbac

Remove no longer needed alicevision_omp.hpp includes

1f56df7

p12tic force-pushed the wrap-openmp branch from 5c5da60 to 1f56df7 Compare October 20, 2022 07:24

natowi mentioned this pull request Jan 2, 2023

[request] Optimize alicevision/Meshroom#1825

Closed

github-actions bot added the stale label Oct 16, 2023

github-actions bot closed this Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for TBB multi-threading backend in addition to OpenMP #1236

Add support for TBB multi-threading backend in addition to OpenMP #1236

p12tic commented Sep 23, 2022 •

edited

Loading

fabiencastan commented Oct 7, 2022

p12tic commented Oct 7, 2022

p12tic commented Oct 11, 2022

Add support for TBB multi-threading backend in addition to OpenMP #1236

Add support for TBB multi-threading backend in addition to OpenMP #1236

Conversation

p12tic commented Sep 23, 2022 • edited Loading

fabiencastan commented Oct 7, 2022

p12tic commented Oct 7, 2022

p12tic commented Oct 11, 2022

p12tic commented Sep 23, 2022 •

edited

Loading