-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel assembly in JuliaFEM #219
Comments
Would it be possible to implement parallel assembly to a higher level than inside Here is the point where assembling starts for each problem. So if we divide for (element_type, elements) in group_by_element_type(elements)
elements_by_color = group_by_color(elements) # or something similar
@threads for elements in elements_by_color
assemble_elements!(problem, assembly, elements, time)
end
end |
Doing it at a higher level would be desirable, yes. In the loop there though, isn't different threads working on different colors at the same time? The threaded loop has to be over a set of elements with the same color. I think a complication now is that each problem is including the element loop in their assembly. Instead of function assemble!(problem)
buf = allocate_local_buffers(problem)
for ele in elements
...
# assemble for each element
...
end
end I think it would be better if a problem defined two things, The code you linked would then be written (in parallel) as for (element_type, elements) in group_by_element_type(elements)
local_buffers = [ProblemLocalBuffers(problem, element_type) for i in 1:Threads.nthreads()]
elements_by_color = group_by_color(elements) # or something similar
for elements_with_one_color in elements_by_color
Threads.@threads for elements in elements_with_one_color
assemble_element!(problem, assembly, element, time, local_buffers[Threads.threadid()])
end
end
end |
The structure above is quite similar how the threaded JuAFEM assembly is done: fill!(K, 0.0)
fill!(f, 0.0)
for color in colors
# Each color is safe to assemble threaded
Threads.@threads for i in 1:length(color)
scratchvalue = get_scratchvalue!(scratchvalues, K, f, dh)
element = color[i]
assemble_cell!(scratchvalue, element, K, dh, prob_data, quad_data, u)
end
end |
First, we actually had a function to assemble only one element at a time, but because we need to allocate workspace to reduce memory usage we go to the strategy of assembling all elements of the same type at a time. But, we have some situations where we cannot assemble only one element at a time because we need certain information about the geometry around the element (e.g. in contact mechanics find segmentation before assembly). We could store that information to Safest would be if we still can support the way of assembling all elements at once and have something like |
Hi! I am a user of FEMSparse and it works quite well for single core. But it seems that the package now could not work for multi-threads. I am wondering if it is necessary to modify it by assembling n sparse matrix and adding them together at last. |
What kind of problems are you experiencing? If you create a sparsity pattern first and then assemble parallel using coloring, it should work. |
I checked my code and fix it. I used to store both the Kgloabl and fglobal in a mutable struct but now I store them in two individual array and there is no problem. Thank you for your reply! |
Ok, great. If you get some results how well is your code scaling when you increase the number of threads, I would be interested to see them. |
I just noticed that it worked because I did not use multi-thread marco :( . Now the problem occurs again: |
...Finally, I know the reason that it cannot assemble parallelly. I create individual assembler for each thread and add the assembler.K together. However, the add operation for SparseCSC seems to drop the zero data, so in the next iteration the FEMSparse cannot work for the location that has not been defined...I have rewritten a function for the sparse matrix adding and it now seems to work again. |
The following is a roadmap/outline for how threaded assembly could start to be implemented in JuliaFEM. The scope is limited to shared memory parallel assembly. It will be continuously updated.
Sample assembly code
In order to make it easier to examplify things, here is some smaple code for an assembly routine taken from JuliaFEM:
Goal
The goal of parallel assembly is, of course, to be able to exploit all threads on the computer to speed up the assembly. Linear solvers are in practice always able to use multiple threads so if the assembly process is serial, that might be a bottleneck in the FEM simulation. There are two main difficulties with parallel assembly, getting it correct and getting it fast.
Correctness
As always with parallel code, one has to be careful not to introduce data races, which means that two threads might write to the same location at the same time. This means that every write operation must be audited to see if there is potential for another thread to write to that location in the same threaded loop. In the assembly rouine above we have writes to two types of data structures which I call "local buffers" and "global buffers"
Local buffers
Local buffers are small enough in size such that allocating a copy of them on each thread is not
prohibitivly expensive. In the code above these would be
bi
,Ke
andfe
.For a parallell assembly this would be allocated
To simplify things, I would suggest that each problem defines a
struct
We could then generate buffers for all threads as
If each thread uses its own
HeatAssemblyBuffer
then there should be no data race when writing to local buffers.Global buffers
Global buffers are data structure that are large enough that we don't want to allocate them in every thread.
This include the global stiffness matrix and the global "force" vector. If we do a parallel assembly naively by just threading the loop over the elements the data race is obvious. Two threads working on elements that share degrees of freedom will try to e.g. assemble into the global force vector at the same location at the same time. The solution to this is to make sure that this never happens.
One way of solving the above problem is "mesh coloring". By assigning a "color" to each element such that no element with the same color share degrees of freedom, we can assemble the elements of that color in parallel.
We then wait for all the threads to be done before moving on to the next color. There are many algorithms for coloring meshes but to start with, a greedy one that just goes element by element and take the first color that is allowed should be good enough. An implementation of this can be found at https://github.com/KristofferC/JuAFEM.jl/blob/master/src/Grid/coloring.jl.
Another way to solve the problem above is to not assemble into a global structure but have e.g. a COO structure for the global stifness that each thread push into, and then in the end, sum all those together.
This is, however, quite likely bad for performance.
Performance
Allocations local buffers
Julia contains a "stop the world" GC which means that every time the GC runs, all threads are stopped and wait for it to finish. If the assembly loop allocates a lot, and we run multiple threads, the GC will have to run quite often, likely removing much of the speedup in using multiple threads. It is therefore very important that the loop over the element doesn't allocate. I would argue, it should allocate 0.
As an exmaple, even if we rewrite the line
to the more efficient
this still need to allocate a temporary for
(dN' * dN)
.There are some solutions to this:
I would suggest trying to do point 1 or 2 because having to allocate so many in-place buffers can make the code hard to read and static data structures fits well for this problem.
Allocations global buffers
Using the coloring technique to assemble directly into a CSC sparse matrix means that there is zero allocation from this assembly step. The same matrix can be used for all time step. Using the method
where each thread has its own COO matrix to assemble into, we need to convert this to CSC and add all the contributions together for every assembly step which will likely be expensive and require a lot of memory.
The text was updated successfully, but these errors were encountered: