-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic backend implementation #126
Conversation
5a0fc40
to
12d7aa5
Compare
12d7aa5
to
c6f4d9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work getting 5x speedup on the small scales!
utilities/include/par.h
Outdated
THRUST_DYNAMIC_BACKEND_VOID(gather, ) | ||
THRUST_DYNAMIC_BACKEND_VOID(gather_if, ) | ||
THRUST_DYNAMIC_BACKEND_VOID(remove_if, _void) | ||
THRUST_DYNAMIC_BACKEND_VOID(unique, _void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the story with just a few having _void
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid name collision with the one define below with THRUST_DYNAMIC_BACKEND
. The reason for having these void variants is to avoid the need of specifying the return type via remove_if<RET>
if we don't need it. I did this a few days ago but not sure if this is still needed, will have a look at that and remove these variants if they are not used.
THRUST_DYNAMIC_BACKEND(remove, void) | ||
THRUST_DYNAMIC_BACKEND(copy_if, void) | ||
THRUST_DYNAMIC_BACKEND(remove_if, void) | ||
THRUST_DYNAMIC_BACKEND(unique, void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between THRUST_DYNAMIC_BACKEND_VOID(copy, )
and e.g. THRUST_DYNAMIC_BACKEND(copy_if, void)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the simplification, but I still don't have the answer to this question. Also, I thought these function do have return values, so void
seems odd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I forgot about this. I was lazy in implementing these functions and forgot to give them a return type. Will fix.
void check_cuda_available() { | ||
int device_count = 0; | ||
cudaError_t error = cudaGetDeviceCount(&device_count); | ||
CUDA_ENABLED = device_count != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
utilities/include/par.h
Outdated
thrust::NAME(thrust::cuda::par, args...); \ | ||
break; \ | ||
case ExecutionPolicy::Par: \ | ||
thrust::NAME(thrust::omp::par, args...); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if you wanted TBB, that would replace OMP here, right? Maybe for build rules we need something like ParUnseq = CUDA | NONE
and Par = OMP | TBB | NONE
? I guess NONE
is mostly about building for compilers that don't support these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I will update this and the CMake script after #125 is merged.
40e497a
to
7a2bb69
Compare
ah, forgot that the build now depends on omp, will fix the cmake script |
7a2bb69
to
a5d8b0c
Compare
The OMP test failure is weird, not exactly sure what is going on. Managed to reproduce it in docker with the settings in the CI, with |
I tried running the program with thread sanitizer enabled. On my machine (NixOS), I only got data race warnings for |
as MSVC does not support them and we don't really need to use these.
This PR is looking good to me; I'd like to test it locally a bit first. Do you feel like it's ready to merge? What behavior are you seeing from the OMP build on Ubuntu? |
I think this should be ready to merge after fixing the minor issue above (forgot to specify the return type of some functions) and clean up the formatting. For the OMP build on Ubuntu, the Boolean.Gyroid test failed with:
Only with |
Oh, that happens a lot; just change the value to 43 and call it good. It just serves to check if the numbers suddenly blow up (in which case edge collapse is broken), but the actual value is arbitrary and tends to shift. |
btw do you want to add a formatting commit to this? and if you do, do you have a clang-format file that you use or I'll just use the default. |
07e919b
to
452cca9
Compare
Looks like an issue with GitHub API. Quite a lot of outage recently. |
A format commit is a good idea. I just have my VSCode set up to do google style for clang-format every time I save. But probably better to put it in the CI. 👍 |
Also, should the CI just build the limited version of Assimp and avoid the other formats, now that you have that flag? In fact, that could even be the default... |
Yes it is now the default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are looking good on my side, thanks!
@pca006132 After a couple days of baffled debug, I finally found the root of a crazy problem that showed up in a test for #187. I thought you might appreciate what I found - I went to bed convincing myself the compiler was broken, but woke up realizing the truth. I was seeing manifold errors in the Boolean (printed assertions like My first guess was that my broad phase had an error so I wasn't checking for all the intersections I should be. I tested this by eliminating it and forcing a brute force check of every intersection. Sure enough, this made CUDA work fine again. But delving into the code, I couldn't find an error anywhere. Finally I got far enough down that I found a case where I was getting two slightly different results from my Finally I remembered: we're using two compilers, one for CPU and one for CUDA. And the choice of which version of each function to use is determined at run time by the vector length. It turns out one pass was long, running CUDA, and one was short, running on the CPU. They must use subtly different operations or ordering that result in slightly different rounding errors occasionally. Overriding the policy to force them onto the same compiled function fixed the issue and verified my hunch. It also explained why my brute force approach seemed to fix the problem: it made all the vectors long enough to choose CUDA. Anyway, I'm going to change the policy system in #187, so that the policy becomes a member of the Boolean class, evaluated only once or twice and then used consistently throughout the internal operations. I think this may even improve performance a touch, as the real cost of CUDA is in moving the data between the host and device; even short vectors are fast as long as the data is already on the device. This should help us be more consistent. |
@elalish Nice! This is actually way I want to implement it, to avoid moving data many times, but I was a bit too lazy to modify the boolean class to make it store the policy. But it seems pretty weird to me that the slight rounding difference between CPU vs GPU will cause such a failure: they should abide to IEEE 754, so the difference probably comes from the increased precision. As mentioned in CUDA documentation, there are quite a lot of low level quirks that may give slightly different rounding error (slightly more precise). I wonder if it is possible to make it more robust against this kind of errors. Can you point me to the part of the paper that have such limitation on floating point operations? |
So, the sensitivity isn't really due to the paper, but the way I optimized it. The issue is Shadow 01 (vertex-edge); the first time I implemented it I calculated these values once and stored them, which means there is no sensitivity. However, that was a lot of memory and I found it quite a bit faster to recalculate them as needed rather than storing them. However, that meant I needed to get repeatable values (the value doesn't matter, only that they match). That also wasn't a problem until we started interchanging the two compilers. |
Oh ok, this makes sense to me now. I guess we should also document it so we will be aware of this in the future. |
dynamic backend implementation
By reinventing the wheels, we can switch backends dynamically depending on the workload, get better performance and run the build with CUDA on machines without CUDA (switch to OMP automatically).
This patch is large because I have to change every single
thrust::ALGORITHM
to a custom function that determines the backend based on the execution policy, which the execution policy is determined based on the workload size (simply comparing against some constant size). We cannot simply pass a different execution policy to thrust, because the type of different execution policy is different, and the implementation here uses macro to build the functions and a switch to choose the correct invocation.CUDA GPU detection is done by checking
cudaGetDeviceCount
and setting a global variable. If there is no CUDA GPU devices available, the GPU code path will be skipped and we will only use OpenMP or sequential implementation.This patch also changed the
VecDH
implementation fromstd::vector
+thrust::universal_vector
to a custom vector implementation, which allows building vectors with uninitialized memory, performsuninitialized_copy
/uninitialized_fill
on GPU/CPU based on data size, and also perform memory prefetching to reduce the number of cache misses. This custom vector implementation is required because the old implementation will cause tons of page faults due to the use ofthrust::uninitialized_fill(thrust::device, ...)
when initializing the vector, which slows down the performance a lot.Benchmark:
For CPP and OMP backend, the difference is not very significant. Below is the result for CUDA:
Most importantly, the time required for small operation is reduced significantly.
TODO: