-
-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize voctree build by up to 40 times for certain problems #1277
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
p12tic
force-pushed
the
optimize-voctree
branch
from
October 10, 2022 03:22
085885f
to
9b0aa70
Compare
p12tic
changed the title
Optimize voctree build by up to 10 times for certain problems
Optimize voctree build by up to 40 times for certain problems
Oct 10, 2022
A further optimization pushed the improvement 10 -> 40 times. I've edited the PR description to reflect this. |
simogasp
reviewed
Oct 10, 2022
p12tic
force-pushed
the
optimize-voctree
branch
from
October 10, 2022 19:17
9b0aa70
to
fd55c80
Compare
@simogasp I've addressed your comment. |
Currently al updates to new_centers array use a single lock defined by an OpenMP critical section. This introduces unnecessary lock contention because updates to information about different centers are independent. Reduction of thread contention improved the runtime of voctree_vocabularyTreeBuild test on AMD 2990WX by roughly 1.2x from ~4.0s to ~3.5s.
New thread setup dominates the runtime of clustering because it is done for each iteration and thus thread team is repeatedly setup and torn down. To avoid this problem, parallelism is disabled if problem is relatively small. This improves the runtime of voctree_vocabularyTreeBuild test on AMD 2990WX by roughly 7-10 times from ~3.5s to ~0.3-0.5s.
Currently each kmeans trial is run sequentially which is later parallelized across all cores when summing distances. This has higher overhead compared to performing all trials in parallel and then parallelizing for summing because less threads are started up. numTrials is currently 5, so the number of thread startups is up to 5 times less than before.
p12tic
force-pushed
the
optimize-voctree
branch
from
October 11, 2022 04:53
fd55c80
to
ae283c2
Compare
simogasp
approved these changes
Oct 11, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes 3 performance problems in voctree which cause inefficiency for relatively small problems:
This PR disables parallelism in case problem size is small and introduces a lock for each cluster center to help when this is not the case. Additionally, k-means initialization now runs in parallel for trials which reduces the thread fanout when calculating new distance sum.
I haven't tested this with large problems, but at least vocabularyTreeBuild test became faster by up to 40 times on a machine with AMD 2990WX (~4s -> 0.1s). The runtime of another relevant test
voctree_kmeans
has been improved from ~3.7s to ~2.7s.Note that AMD 2990WX has 32 cores/64 threads, so any lock contention problems are exacerbated there. It's not top of the line machine though, as 128 core/256 thread machines have been available for several years already.