Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize voctree build by up to 40 times for certain problems #1277

Merged
merged 3 commits into from
Oct 11, 2022

Conversation

p12tic
Copy link
Contributor

@p12tic p12tic commented Oct 10, 2022

This PR fixes 3 performance problems in voctree which cause inefficiency for relatively small problems:

  • all center updates depend on a single lock during clustering
  • thread pool is created per-iteration during clustering
  • thread pool is created per-iteration during k-means initialization

This PR disables parallelism in case problem size is small and introduces a lock for each cluster center to help when this is not the case. Additionally, k-means initialization now runs in parallel for trials which reduces the thread fanout when calculating new distance sum.

I haven't tested this with large problems, but at least vocabularyTreeBuild test became faster by up to 40 times on a machine with AMD 2990WX (~4s -> 0.1s). The runtime of another relevant test voctree_kmeans has been improved from ~3.7s to ~2.7s.

Note that AMD 2990WX has 32 cores/64 threads, so any lock contention problems are exacerbated there. It's not top of the line machine though, as 128 core/256 thread machines have been available for several years already.

@p12tic p12tic changed the title Optimize voctree build by up to 10 times for certain problems Optimize voctree build by up to 40 times for certain problems Oct 10, 2022
@p12tic
Copy link
Contributor Author

p12tic commented Oct 10, 2022

A further optimization pushed the improvement 10 -> 40 times. I've edited the PR description to reflect this.

@p12tic
Copy link
Contributor Author

p12tic commented Oct 10, 2022

@simogasp I've addressed your comment.

Currently al updates to new_centers array use a single lock defined by
an OpenMP critical section. This introduces unnecessary lock contention
because updates to information about different centers are independent.

Reduction of thread contention improved the runtime of
voctree_vocabularyTreeBuild test on AMD 2990WX by roughly 1.2x from
~4.0s to ~3.5s.
New thread setup dominates the runtime of clustering because it is done
for each iteration and thus thread team is repeatedly setup and torn
down. To avoid this problem, parallelism is disabled if problem is
relatively small.

This improves the runtime of voctree_vocabularyTreeBuild test on AMD
2990WX by roughly 7-10 times from ~3.5s to ~0.3-0.5s.
Currently each kmeans trial is run sequentially which is later
parallelized across all cores when summing distances. This has higher
overhead compared to performing all trials in parallel and then
parallelizing for summing because less threads are started up. numTrials
is currently 5, so the number of thread startups is up to 5 times less
than before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants