feat: Parallelised folding in Gemini#550
Conversation
|
I'm still digging into the details but I have to admit I'm finding the code quite hard to read. I think you said that you originally tried simply parallelizing each round of folding but that the gain was not significant. It seems like your current approach is to split portions of the d-many folding rounds across individual threads. Is this a significant improvement over the first approach? I ask because as far as I can tell, the first approach would literally be a 1 line change. If the difference is not significant, perhaps the added complexity is not worth it.. |
ledwards2225
left a comment
There was a problem hiding this comment.
Very nice. Most of my comments are just points for general discussion around multithreading
| const size_t num_variables = mle_opening_point.size(); // m | ||
|
|
||
| const size_t num_threads = get_num_cpus_pow2(); | ||
| constexpr size_t efficient_operations_per_thread = 64; // A guess of the number of operation for which there |
There was a problem hiding this comment.
We can always play with this later but I'm just curious whether this is motivated by some experimentation or if its really just a guess
There was a problem hiding this comment.
It's a guess. I think it would be helpful to actually check what parallelize_for is useful for. I'll add an issue to concretely evaluate the size of computation for which it is efficient.
| // A_l_fold = Aₗ₊₁(X) = (1-uₗ)⋅even(Aₗ)(X) + uₗ⋅odd(Aₗ)(X) | ||
| auto A_l_fold = fold_polynomials[l + offset_to_folded].data(); | ||
|
|
||
| parallel_for(num_used_threads, [&](size_t i) { |
There was a problem hiding this comment.
As far as I can tell, parallel_for will determine the number of threads to use via openMP and essentially just ignores the input num_used_threads. Maybe this is fine but I wonder if we should update parallel_for to call omp_set_num_threads() so the num threads that is input is actually enforced.
There was a problem hiding this comment.
I suppose since parallel_for defines a loop based on the num threads it gets as input, num_used_threads will indeed be the max num threads used for multithreading, there is just no guarantee that it wont be fewer.
There was a problem hiding this comment.
Yes, that's why I get the value of the number of available threads first and then compute num_used_threads based on that
| size_t num_used_threads = std::min(n_l / efficient_operations_per_thread, num_threads); | ||
| num_used_threads = num_used_threads ? num_used_threads : 1; | ||
| size_t chunk_size = n_l / num_used_threads; | ||
| size_t last_chunk_size = (n_l % chunk_size) ? (n_l % num_used_threads) : chunk_size; |
There was a problem hiding this comment.
I think this robustness is nice to have but I wonder if in practice we can always assume (otherwise, enforce) that num_threads is a power of 2 and thus divides n_l.
There was a problem hiding this comment.
Lots of 6-core laptops out there)
Description
The folding operation in Gemini now used parallelism. I also split gemini.hpp into gemini.cpp and gemini.hpp for faster compilation.
Checklist:
@briefdescribing the intended functionality.includedirectives have been added.