Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planetary impact simulation super slow with large dead time #22

Open
JingyaoDOU opened this issue Oct 27, 2021 · 43 comments
Open

Planetary impact simulation super slow with large dead time #22

JingyaoDOU opened this issue Oct 27, 2021 · 43 comments
Assignees

Comments

@JingyaoDOU
Copy link

Dear SWIFT team,
I'm having a serious slow downing issue when trying to use swift_mpi running planetary impact simulation on our HPC. The issue can only happen if I use a specific number of nodes and CPUs (basically all happened when I tried to use 2 or 4 full nodes). I've listed some situations below that I've tested:

  • HPC1 -----10^5 particles 2node56cpu----Failed
  • HPC1------10^5 particles 2node16cpu----good
  • HPC1------10^5 particles 1node28cpu----good
  • HPC2------10^5 particles 2node48cpu----good
  • HPC2------10^6 particles 4node96cpu----Failed
    the timestep plot of the HPC-2 10^6 particles simulation is like this:
    bp_issue_65e5_4node96cpu
    After some steps, the dead time becomes extremely large basically taking 98% CPU time of each step.

Here is my configure and running recipe:

./configure --with-hydro=planetary --with-equation-of-state=planetary --enable-compiler-warnings=yes 
--enable-ipo CC=icc MPICC=mpiicc --with-tbbmalloc --with-gravity=basic

submit script like this:

##HPC-2 10^6 particles 4node96cpu
#PBS -l select=4:ncpus=24:mpiprocs=2:ompthreads=12:mem=150gb

time mpirun -np 8 ./swift_mpi -a -v 1 -s -G -t 12 parameters_impact.yml 2>&1 | tee $SCRATCH/output_${PBS_JOBNAME}.log
##HPC-1 10^5 particles 2node56cpu
#SBATCH --cpus-per-task=14
#SBATCH --tasks-per-node=2
#SBATCH --nodes 2
#SBATCH --mem=120G 

time mpirun -np 4 ./swift_mpi_intel -a -v 1 -s -G -t 14 parameters_impact.yml 2>&1 | tee $SCRATCH/output_${SLURM_JOB_NAME}.log

parameters file and initial condition:
parameters yml file
Initial condition

ParMETIS lib is currently unavailable on both HPC, so I can't use that. Also, HPC 2 doesn't have parallel hdf5 lib, but the same issue happened on HPC 1 which has parallel hdf5 loaded.

Here below is the log file from 10^6 particles simulation on HPC2:
10^6 output log file
rank_cpu_balance.log
rank_memory_balance.log
task_level_0000_0.txt
timesteps_96.txt

I'm doing some not very series benchmark tests with swift and Gadget when running planetary simulations. The below plot shows some results, which suggest SWIFT is always running a little bit slower than Gadget2 during the period of 1.5~8h (simulation time). The gif shows the period where this sluggish situation happened. Looks like SWIFT gets slower when the position of particles changed dramatically. After this period two planets just merged into a single one and behave not as drastically as in the first stage. I was expecting SWIFT to be always faster during the simulation, do you have any suggestions on how I could improve the performance of the code.
benchmark2
ezgif com-gif-maker

@JingyaoDOU JingyaoDOU changed the title Planetary impact imulation super slow with large dead time Planetary impact simulation super slow with large dead time Oct 27, 2021
@MatthieuSchaller
Copy link
Member

Hi,

Glad to see that you are interested in our code and that you already have results.

This is a known limitation of the code in its current form. Specifically, the particles ejected from the system are still trying to search for neighbours and by doing so also slow down the other neighbour search. A change to the algorithm for neighbour finding has been in the works for a while but it's a non-trivial modification of some of the core algorithms.

Things that may help now:

  • MPI is only going to make things worse here. You cases are small enough that they should easily fit on a single node or even a laptop. Anything bigger is just going to waste more CPU time and electricity here. At least until we have made the improvements hinted at above.
  • You can help the code a bit by reducing SPH:h_max in the parameter file. This sets the maximal smoothing lengths. Particles with h>h_max are capped and are effectively treated as just balistic objects. Their density and other forces are effectively meaningless anyway.
    Capping the maximal h helps avoid the problematic situation described above.
    How small? Well, h_max also effecively set the minimal density you are happy to resolve (1 particle in a h_max-radius sphere). See what is the level that is acceptable for your science.

@JingyaoDOU
Copy link
Author

Dear Matthieu,

Thank you so much for your advice, I'll try to reduce the h_max. Are you suggesting the unreasonable large h_max set is the reason that caused the mpi version code to be frozen somehow? Sometimes, at some point, the code just goes to a "freezing" status which gives me no error and won't stop either. And it happened basically at the same number of steps. The same IC and yml file will work well if I use fewer CPUs or only one node. I first thought it has something to do with domain decomposition, as for now, ParMETIS is not available, only "grid" strategy can be used. I'm not sure if it's the poor domain decomposition method that freezes the code. Could you please give me some suggestions on this issue? Thank you very much!

@MatthieuSchaller
Copy link
Member

Could you expand on what you see happening in these MPI runs?
Is the code hanging or progressing very slowly?

Note that parmetis will be unlikely to help significantly here.

@JingyaoDOU
Copy link
Author

normally it takes around 1min to generate a snapshot then at some point it suddenly begins takes 10mins and then 30mins 1hours and one night, and I hit the time limit. I guess it's becoming super slow at some point and getting worse with growing steps.

@jkeger
Copy link
Member

jkeger commented Oct 29, 2021

Hi Jingyao,

Likewise it's great to hear that you're using the code, and it looks like maybe WoMa/SEAGen too? Are you working with Zoe Leinhardt perhaps? I've never met her but have had some email contact with others in the Bristol group, it's great that some other planetary impact work is happening in the UK!

And thank you for raising this issue, we're actively continuing to develop swift so it's helpful to have more than just our small planetary group within Durham testing the code in different situations.

Matthieu knows much more about the inner workings and performance of the code than me, but from my experience I'd guess there's a chance this is not related to h_max. I felt the large-h_max slowdown we've seen made sense because it kicked in soon after the initial impact -- right when particles start flying away from the initial planets and begin having to search very far away for neighbours if not corralled by h_max.

So the fact that this is happening at much later times surprises me. But I suppose there's no harm in testing it. I have less intuition for 10^5 particles, but perhaps 10% or even 5% of the planet radius would be a place to start? If that is the problem then I hope we aren't too far away from merging in the fix, and the "good" news is that SPH probably shouldn't be trusted when the density is low enough that neighbours are so far away, so if that change affects the science results then that could be a hint that the original resolution is too low regardless.

We very rarely run planetary simulations over MPI (mostly since cosma nodes at Durham have many cores so a single node is much more efficient in our case), so in that context I'd also not be surprised if there are other sources of slowdown (my guess might also be the decomposition like you, though I'm not an expert on that part of the code) that we haven't encountered ourselves.

We have also not done many speed tests with as few as 10^5 particles. I hope and suspect that swift will outperform gadget with higher resolution simulations.

I hope some of that long reply might help you or Matthieu! I'm currently in a US time zone getting started at a new post so may not be able to respond immediately, but I'll keep checking in when I can.

Best,

Jacob

@JingyaoDOU
Copy link
Author

Dear Jacob,

Thank you very much for your kind reply! Yes, I'm Zoe's student and I'm using WOMA and SWIFT. Thanks to the brilliant code, my life could get easier.

Sorry for the late reply, it takes me some time to test with h_max on our HPC. It turns out, scaling down the h_max does fix the problem, the code finally goes well. Really appreciate the advice from you and Matthiew.

As you suggested, I'm using too many CPUs while having a very low resolution which will give each cell very few particles and may cause the slow down problem.

Again, many thanks for your patience and time!

@MatthieuSchaller
Copy link
Member

That's great news. Glad you got it to work as intended.

Parameters are not always easy to figure out. :)

@jkeger
Copy link
Member

jkeger commented Nov 1, 2021

Great news, and you're very welcome, it's great that other people are starting to make use of the codes -- and to help us find and resolve issues like this one!

Hopefully we'll soon be able to finish testing and merge the fixes that stop larger h_max's from slowing the code in the first place.

I can at least say from our tests that using a smaller value doesn't affect the major (resolution-appropriate) outcomes of simulations, as you'd intuitively hope. The issue would be trying to inspect something like the detailed thermodynamic states of low-density debris.

I can't always promise immediate responses, but do stay in touch and let us know if anything comes up, including requests for future features in either code -- and keep an eye out for ones that appear from time to time anyway :)

And of course especially being in the UK (or at least the rest of the Durham group besides me now!) we could always meet up to chat or collaborate, besides this occasional tech support haha. At any rate I'll hope to maybe meet you and Zoe et al at a conference some time.

@JingyaoDOU
Copy link
Author

Dear all,
SWIFT gives me some weird results when I tried to use very large h_max. Since Gadget doesn't have this h_max limit, and I want two simulations as close as possible, so I set the h_max in swift as large as the box size, which means there will always be around 48 neighborhoods for each particle no matter how far they are from others. As you can see from the plot, after some time there appears gaps at x=0,y=0 axis, and the planet body was split into 4 parts.
I also did some other tests with h_max varies from 20 earth radius (which is 0.2 box size) to 100 earth radius. This kind of situation happened at all these tests. The box size was all 100 earth radius and all the test was done using only one node, so no mpi was involved. This seems to be related to the nearest neighbor finding method but I'm not sure. Could you please tell me if there is any limit to the h_max?
benchmark7
benchmark8

impact_05715_0468_5e4_9d7_055_100box_100h

@MatthieuSchaller
Copy link
Member

I don't know precisely what is going on here.

My initial guess is that by allowing a very large h_max, you effectively let h be as large as the code allows, which is 1/3 of the box size.
When h increases in the highest-level of the tree, the code needs to rebuild the entire gravity tree and construct it from a new base. This is a very dramatic event. It never happens in normal simulations so it is possible that we are hitting a bug in the code at this point.

Could you share your parameter file?

As a side note, Gadget does not have an h_max, but when the particle density becomes very low, the calculation is wrong anyway (or rather leads to a breakdown of the method). So by setting an h_max you actually set a hard limit rather than possibly fooling yourself in thinking you have the resolution to get infinitely small densities.

@JingyaoDOU
Copy link
Author

The parameter file and initial condition file are loaded here, please have a look.

You are absolutely right about the h_max, I didn't realize this problem until you mentioned it last time. Without h_max limit, Gadget indeed can have ridiculous smoothing length and that should be avoided. Based on my current impact simulation where I use around 8% of the largest planet radius in the simulation as the value of h_max, after 20h simulation time, around 10%~12% particles will hit this h_max limit whereas all other particles have a reasonable smoothing length distribution. I think a comparison test between this small h_max and a reasonably large h_max simulation would help me to identify whether the density error introduced by these 10% particles will affect the final result greatly.
A test with h_max equals 1/5 box size still shows this x=0,y=0 gap, while, 1/10 box size h_max doesn't have this issue.

@MatthieuSchaller
Copy link
Member

From the logs of the run, can you check whether the code performed a regriding?

@JingyaoDOU
Copy link
Author

Sorry for the late reply, the previous simulation didn't have verbose output, so I run it again. Below is the log for 1 step around the time when gaps begin to appear and it shows there is space regriding.

 5991   2.431934e+04    1.0000000    0.0000000   8.789062e+00   43   43         5553         5553            0            0            0               223.613      1               164.881
[45684.9] engine_step: Writing step info to files took 0.011 ms
[45684.9] engine_step: Updating general quantities took 0.001 ms
[45684.9] engine_prepare: Communicating rebuild flag took 0.001 ms.
[45684.9] space_synchronize_particle_positions: took 0.643 ms.
[45684.9] engine_drift_all: took 9.250 ms.
[45684.9] scheduler_report_task_times: *** CPU time spent in different task categories:
[45684.9] scheduler_report_task_times: ***                drift:     0.33 ms (0.01 %)
[45684.9] scheduler_report_task_times: ***                sorts:    84.23 ms (1.44 %)
[45684.9] scheduler_report_task_times: ***               resort:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***                hydro:  1034.86 ms (17.74 %)
[45684.9] scheduler_report_task_times: ***              gravity:    90.39 ms (1.55 %)
[45684.9] scheduler_report_task_times: ***             feedback:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***          black holes:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***              cooling:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***       star formation:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***              limiter:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***                 sync:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***     time integration:     5.97 ms (0.10 %)
[45684.9] scheduler_report_task_times: ***                  mpi:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***                 pack:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***                  fof:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***               others:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***             neutrino:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***                 sink:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***                   RT:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***                 CSDS:     0.00 ms (0.00 %)
[45684.9] scheduler_report_task_times: ***            dead time:  4616.41 ms (79.15 %)
[45684.9] scheduler_report_task_times: ***                total:  5832.19 ms (100.00 %)
[45684.9] scheduler_report_task_times: took 0.236 ms.
[45684.9] space_regrid: h_max is 1.737e+01 (cell_min=7.071e+00).
[45684.9] space_free_cells: took 0.102 ms.
[45684.9] space_regrid: took 0.106 ms.
[45684.9] space_parts_get_cell_index: took 0.375 ms.
[45684.9] space_gparts_get_cell_index: took 0.288 ms.
[45684.9] space_rebuild: Moving non-local particles took 0.001 ms.
[45684.9] space_rebuild: Have 8 local top-level cells with particles (total=8)
[45684.9] space_rebuild: Have 8 local top-level cells (total=8)
[45684.9] space_rebuild: hooking up cells took 0.005 ms.
[45684.9] space_split: took 3.332 ms.
[45684.9] space_rebuild: took 4.567 ms.
[45684.9] engine_rebuild: Nr. of top-level cells: 8 Nr. of local cells: 1438 memory use: 1 MB.
[45684.9] engine_rebuild: Nr. of top-level mpoles: 8 Nr. of local mpoles: 1438 memory use: 0 MB.
[45684.9] engine_rebuild: Space has memory for 104034/104034/0/0/0 part/gpart/spart/sink/bpart (12/8/0/0/0 MB)
[45684.9] engine_rebuild: Space holds 104034/104034/0/0/0 part/gpart/spart/sink/bpart (fracs: 1.000000/1.000000/0.000000/0.000000/0.000000)
[45684.9] engine_rebuild: updating particle counts took 0.001 ms.
[45684.9] engine_estimate_nr_tasks: tasks per cell given as: 2.00, so maximum tasks: 2876
[45684.9] engine_maketasks: Making hydro tasks took 0.031 ms.
[45684.9] engine_maketasks: Making gravity tasks took 0.059 ms.
[45684.9] scheduler_splittasks: space_subsize_self_hydro= 32000
[45684.9] scheduler_splittasks: space_subsize_pair_hydro= 256000000
[45684.9] scheduler_splittasks: space_subsize_self_stars= 32000
[45684.9] scheduler_splittasks: space_subsize_pair_stars= 256000000
[45684.9] scheduler_splittasks: space_subsize_self_grav= 32000
[45684.9] scheduler_splittasks: space_subsize_pair_grav= 256000000
[45684.9] engine_maketasks: Splitting tasks took 0.067 ms.
[45684.9] engine_maketasks: Counting and linking tasks took 0.083 ms.
[45684.9] engine_maketasks: Setting super-pointers took 0.077 ms.
[45684.9] engine_maketasks: Making extra hydroloop tasks took 0.029 ms.
[45684.9] engine_maketasks: Linking gravity tasks took 0.020 ms.
[45684.9] engine_maketasks: Nr. of tasks: 853 allocated tasks: 2876 ratio: 0.296593 memory use: 0 MB.
[45684.9] engine_maketasks: Nr. of links: 136 allocated links: 880 ratio: 0.154545 memory use: 0 MB.
[45684.9] engine_maketasks: Actual usage: tasks/cell: 0.593185 links/task: 0.159437
[45684.9] engine_maketasks: Setting unlocks took 0.026 ms.
[45684.9] engine_maketasks: Ranking the tasks took 0.014 ms.
[45684.9] scheduler_reweight: took 0.034 ms.
[45684.9] engine_maketasks: took 0.809 ms (including reweight).
[45684.9] space_list_useful_top_level_cells: Have 8 local top-level cells with tasks (total=8)
[45684.9] space_list_useful_top_level_cells: Have 8 top-level cells with particles (total=8)
[45684.9] space_list_useful_top_level_cells: took 0.005 ms.
[45684.9] engine_marktasks: took 0.615 ms.
[45684.9] engine_print_task_counts: Total = 853 (per cell = 0.59)
[45684.9] engine_print_task_counts: task counts are [ none=0 sort=8 self=14 pair=40 sub_self=10 sub_pair=16 init_grav=8 init_grav_out=40 ghost_in=8 ghost=487 ghost_out=8 extra_ghost=0 drift_part=8 drift_spart=0 drift_sink=0 drift_bpart=0 drift_gpart=8 drift_gpart_out=40 end_hydro_force=8 kick1=8 kick2=8 timestep=8 timestep_limiter=0 timestep_sync=0 send=0 recv=0 pack=0 unpack=0 grav_long_range=8 grav_mm=0 grav_down_in=40 grav_down=8 grav_end_force=8 cooling=0 cooling_in=0 cooling_out=0 star_formation=0 star_formation_in=0 star_formation_out=0 star_formation_sink=0 csds=0 stars_in=0 stars_out=0 stars_ghost_in=0 stars_density_ghost=0 stars_ghost_out=0 stars_prep_ghost1=0 hydro_prep_ghost1=0 stars_prep_ghost2=0 stars_sort=0 stars_resort=0 bh_in=0 bh_out=0 bh_density_ghost=0 bh_swallow_ghost1=0 bh_swallow_ghost2=0 bh_swallow_ghost3=0 fof_self=0 fof_pair=0 neutrino_weight=0 sink_in=0 sink_ghost=0 sink_out=0 rt_in=0 rt_out=0 sink_formation=0 rt_ghost1=0 rt_ghost2=0 rt_transport_out=0 rt_tchem=0 skipped=62 ]
[45684.9] engine_print_task_counts: nr_parts = 104034.
[45684.9] engine_print_task_counts: nr_gparts = 104034.
[45684.9] engine_print_task_counts: nr_sink = 0.
[45684.9] engine_print_task_counts: nr_sparts = 0.
[45684.9] engine_print_task_counts: nr_bparts = 0.
[45684.9] engine_print_task_counts: took 0.123 ms.
[45684.9] engine_rebuild: took 6.377 ms.
[45684.9] engine_prepare: took 15.631 ms (including unskip, rebuild and reweight).
[45686.7] engine_launch: (tasks) took 1768.267 ms.
[45686.7] engine_collect_end_of_step: took 0.111 ms.

@MatthieuSchaller
Copy link
Member

MatthieuSchaller commented Nov 30, 2021

Ok, so that's what I was worried about. It did indeed regrid all the way down to the minimum possible 2x2x2 top-level grid.

That's a rather unusual scenario and I am not sure we are ready for it. There are two things I do not know whether
we are ready for:

  • Such a small grid
  • The regriding itself.

You could try setting h_max to 1/6 of the box size. And then set Scheduler:max_top_level_cells to 3. In this way you will start the simulation already in the smallest possible grid (so no regriding) and set h_max such no regriding happens later on.

This may be slower but that's all I can suggest for now before we have time to have a deeper look at the gravity solver behaviour when regriding.

You can also make the box size larger such that 1/6 of it is as large as you need h_max to be.

@JingyaoDOU
Copy link
Author

Thank you very much for the advice, I will give it a try.

  • I'm a little confused about the regrid and rebuild process here. It seems regriding happened at the very beginning of the simulation but with a large number of local top-level cells. With time going, the number of local top-level cells becomes smaller and smaller and finally hits 8 at some point. Does "regrid" here mean regenerating the neighbor tree and "rebuild" mean redistributing the particles in each cell?

  • When running SWIFT with reasonably small h_max, how large should I set the max_top_level_cells ? It seems from Jacob's Gitlab issue, larger max_top_level_cells will speed up the simulation. Right now, I just let this number equal the number of CPUs per chip for no reason.

  • Also, could you please tell me the unit h_max and cell_min in the log file, does it use the internal unit or something else ? It seems not always the value I set initially in the parameter file.

Thank you very much for the help.

@MatthieuSchaller
Copy link
Member

Rebuild == normal tree construction. Happens all the time and is safe.

Regrid == much more severe change. This is a procedure where we change the size of the base grid used. The tree is built on top of this base grid. This never happens in normal simulations.

The base grid cells have to be smaller than 2 * max(h). So if you let h grow unlimited, at some point the code will have to create a new base grid.


Scaling the number of cells with the number of CPUs makes no sense. I'd make it 16x16x16 and only make it larger if the gravity solver gets very slow.


h_max is in internal units. i.e. whatever unit system you set in the parameter file.
cell_min is also in internal units.

cell_min is not a parameter.

@JingyaoDOU
Copy link
Author

Really appreciate your help, it's much clear now! Thank you again for your time and patience.

@JingyaoDOU
Copy link
Author

Dear Matthieu and Jacob,
Yes, it's me again :), sorry to bother you guys before the holiday. Can I ask two simple questions I recently found that confused me?

  • Followed by Matthieu's advice I turn off the regriding and use some very larger boxsize with h_max set to 1/6 of boxsize. To my surprise, with boxsize set to 100 earth radius, it took almost 3 days to finish the impact simulation( 1e5 particles in total) while simulation with boxsize 900 earth radius only takes 5h to finish. Some test results can be seen below:

    boxsize/ R_earth 100 300 600 900 1200
    1node28cpu 64h >15h 10h 4.9h 4.15h
    Particle removed True True True False False

    The final results though don't have too much difference and all tests were done on the same HPC with 1 node.
    Do you have any idea why should this happen, does it have something to do with removed particles?

  • I'm currently trying to cool several planetesimals with 1e7 particles generated by WOMA. Sometimes the code will hang there at the beginning and after a relatively long time, it will raise an error: scheduler.c:scheduler_write_task_level():2607: Cell is too deep, you need to increase max_depth. This will not happen if I sightly change the radius profile when generating the planetesimals snapshot. My guess is some profile will place too many particles around the same location and swift don't like this. In this circumstance, do you think I should go to the source code of scheduler.c to change the max_depth (which is 30 in the source code) or should I change cell_split_size or cell_subdepth_diff_grav in the parameter file?

Thank you very much for your time and help.

Happy Christmas!

@MatthieuSchaller
Copy link
Member

Can you tell me what the max_top_level_cells values are for your five runs?

Also, 30 levels deep seems implausible if you only have so few particles. The part of the code where you get the error is only for diagnostics. But it's a very bad sign if you need that many levels in the tree.
I would think that you have many particles at exactly the same position in your ICs.

@JingyaoDOU
Copy link
Author

Hi, Matthieu, I use 3 for max_top_level_cells as you told me before to turn down the regriding. h_max is around 1/6 of the boxsize. I use 49 R_earth for the 300 R_earth boxsize and 99 R_earth for the 600 R_earth boxsize. So it's not exactly 1/6.

It could be WOMA somehow randomly placed too many particles at the same position since the warning didn't always show. Do you think I should change the max_depth in the source code to some smaller value since I only use SWIFT to do some planetary impact simulation and the maximum number of particles won't exceed 1e8.

@MatthieuSchaller
Copy link
Member

No, I would not change the max depth. It is a sign that something is not good with the ICs. I would have a look at them to make sure everything is is sensible.

@JingyaoDOU
Copy link
Author

Thank you very much for the quick reply. Do you have any idea why the simulation time would vary with boxsize?

@MatthieuSchaller
Copy link
Member

Not yet. Likely the neighbour finding being silly in some configurations. All these setups are very far from the normal way we operate.

@JingyaoDOU
Copy link
Author

No worries, these are just some extreme tests, we basically won't use this large h_max and boxsize in the real simulation. We just want to understand under what circumstance SWIFT will give us trustable results in terms of planetary simulations.

@MatthieuSchaller
Copy link
Member

I'll have a more detailed look in the new year.

@MatthieuSchaller
Copy link
Member

Regarding #22 (comment), could you try your test once more but using the code branch fix_small_range_gravity ?

@JingyaoDOU
Copy link
Author

Hi Matthieu, thank you for the fixed version of the code. I have tried the new branch code with equal to boxsize h_max with max_top_level_cells=16, it appeared that gap around x=0, y=0 still showed up.

@MatthieuSchaller
Copy link
Member

Could you verify whether the latest version of the code in master fixes the issue?

@JingyaoDOU
Copy link
Author

Sorry for the late reply, HPC is very busy these days.

The issue is fixed using the new master branch. Gaps around x=0 and y=0 disappear and the simulation result of large hmax is similar to that of the smaller ones. The simulation time is still very large for the large hmax one.

@MatthieuSchaller
Copy link
Member

Great. Thanks for confirming. At least the bug is out. We'll keep working on the speed.

@Francyrad
Copy link

Hi,

Glad to see that you are interested in our code and that you already have results.

This is a known limitation of the code in its current form. Specifically, the particles ejected from the system are still trying to search for neighbours and by doing so also slow down the other neighbour search. A change to the algorithm for neighbour finding has been in the works for a while but it's a non-trivial modification of some of the core algorithms.

Things that may help now:

  • MPI is only going to make things worse here. You cases are small enough that they should easily fit on a single node or even a laptop. Anything bigger is just going to waste more CPU time and electricity here. At least until we have made the improvements hinted at above.
  • You can help the code a bit by reducing SPH:h_max in the parameter file. This sets the maximal smoothing lengths. Particles with h>h_max are capped and are effectively treated as just balistic objects. Their density and other forces are effectively meaningless anyway.
    Capping the maximal h helps avoid the problematic situation described above.
    How small? Well, h_max also effecively set the minimal density you are happy to resolve (1 particle in a h_max-radius sphere). See what is the level that is acceptable for your science.

@MatthieuSchaller do you suggest to use just ./swift instead of mpirun for a 64 core and 128 threads machine (single node). Thank you in advance for the answer

@MatthieuSchaller
Copy link
Member

Yes, most definitely. Don't use the mpi version unless you actually need to use more than one node.

@Francyrad
Copy link

I take the opportunity to aks one more question:

I compiled with:

./configure --enable-compiler-warnings --with-tbbmalloc --with-parmetis=/home/francesco-radica/Documenti/parmetis-4.0.3 --with-hydro=planetary --with-equation-of-state=planetary

I'm running my simulation with the command:

../../../swift -s -G -t 128 earth_impact.yml 2>&1 | tee output.log

and in my system i have:

swift_parameters: -s -G
swift_threads: 128

My cpu has in total 64 cores and 128 threads. The point is that according to the task manager it's not working at 100% (while m1 Pro worked every time at 100%). This doesn't happen with mpirun (other programs), where i can clearly see that is insanely faster.

image

It's like fluctuating (and I also think that, in proportion, is not as fast as it should be compared to my laptop (M1Pro)). It's surely faster, but compared with the "old CPU" it should be somekind of 10x faster at least.
My question is if i'm doing something wrong with setup and commands,
thank you in advance for the answer

@mladenivkovic
Copy link
Contributor

Hi

Maybe try running without hyperthreading, i.e. using a single thread per core. IIRC in most situations our code is memory bound, so hyperthreading shouldn't do it any favours in terms of speed. On the contrary, I'd expect a performance penalty.

When you say "insanely faster", do you mean in terms of how fast the simulation advances, or in terms of how much CPU usage your system reports?

@Francyrad
Copy link

Francyrad commented Apr 17, 2024

thank you for the answer. When I say Insanely faster i mean "how fast the simulation advance".
I did a try running with 64 threads instaead of 128 threads and the simulation was slower.

idk, should I try a slower number (10-20-30?), should i just compile with planetary planetary eos and without using tbb and parametis? I also have enough RAM to handle the whole thing.

The point is that i'm a little bit confused because with COSMA super computer which uses thousands of cores you can handle thousands of simulations with 10kk particles each, so I should expect that using more threads and cores means just more speed.

I wait for suggestions and explanations, thank you in advance!

Francesco

@mladenivkovic
Copy link
Contributor

mladenivkovic commented Apr 17, 2024

I did a try running with 64 threads instaead of 128 threads and the simulation was slower.

Have you made sure you were running on 64 cores?

idk, should I try a slower number (10-20-30?),

No, not lower. One thread per core is optimal.

should i just compile with planetary planetary eos and without using tbb and parametis?

tbbmalloc should be an improvement. Metis and parmetis are only used when running with MPI, so in your case, it shouldn't matter.

I should expect that using more threads and cores means just more speed.

Neglecting MPI effects, that should indeed be the case. Unless we have some strange bottleneck that I'm not aware of.
By the way, what hardware are you running on?

One more thing that comes to mind that might help is to run with pinned threads. You can activate that with the -a or --pin flag.

@Francyrad
Copy link

i'm doing a try witht he command:

../../../swift -s -G -t 64 -v 2 simulation.yml 2>&1 | tee output.log

and:

input:

  • simulation.yml

output:

  • simulation.pdf

swift_parameters: -s -G
swift_threads: 64

This is the task manager:

image

and it says that i'm running with 12% of CPU more or less

It's also creating 1 snap every minute, like the 128 threads run (maybe because the collision didn't started? i don't know...)

I'm running on a ThreadRipper 7980X and i have 200 Gb of RAM.

tomorrow i'll do a run with --pin adn 128 threads, i promise

@MatthieuSchaller
Copy link
Member

How many particles do you have?

@Francyrad
Copy link

Francyrad commented Apr 18, 2024

One more thing that comes to mind that might help is to run with pinned threads. You can activate that with the -a or --pin flag.

I'm running with the command:
../../../swift -s -G -t 128 -v 2 --pin simulation.yml 2>&1 | tee output.log

The speed seems the same, not matter if i use 64 or 128 threads...

How many particles do you have?

I have 1.5 millions particles.

ALso, the simulation is very fast when they are joined, it start to slow down immediatly after the collision to become then faster when the system is settled.

units:

InternalUnitSystem:
    UnitMass_in_cgs:        1e27        # Sets Earth mass = 5.972
    UnitLength_in_cgs:      1e8         # Sets Earth radius = 6.371
    UnitVelocity_in_cgs:    1e8         # Sets time in seconds
    UnitCurrent_in_cgs:     1           # Amperes
    UnitTemp_in_cgs:        1           # Kelvin

# Parameters for the hydrodynamics scheme
SPH:
    resolution_eta:     1.2348          # Target smoothing length in units of the mean inter-particle separation (1.2348 == 48Ngbs with the cubic spline kernel).
    delta_neighbours:   0.1             # The tolerance for the targetted number of neighbours.
    CFL_condition:      0.2             # Courant-Friedrich-Levy condition for time integration.
    h_max:              1.2          # Maximal allowed smoothing length (in internal units).
    viscosity_alpha:    1.5             # Override for the initial value of the artificial viscosity.

# Parameters for the self-gravity scheme
Gravity:
    eta:                            0.025       # Constant dimensionless multiplier for time integration.
    MAC:                            adaptive    # Choice of mulitpole acceptance criterion: 'adaptive' OR 'geometric'.
    epsilon_fmm:                    0.001       # Tolerance parameter for the adaptive multipole acceptance criterion.
    theta_cr:                       0.5         # Opening angle for the purely gemoetric criterion.
    max_physical_baryon_softening:  0.05        # Physical softening length (in internal units).

# Parameters for the task scheduling
Scheduler:
    max_top_level_cells:    16        # Maximal number of top-level cells in any dimension. The nu

@MatthieuSchaller
Copy link
Member

Ok, that's not many particles for this number of cores. At some point you won't have enough particles / core to make good use of the extra resources.

When it slows down around the collision is it because the time-step size drops?

@Francyrad
Copy link

Ok, that's not many particles for this number of cores. At some point you won't have enough particles / core to make good use of the extra resources.

Do you mean that if i use a larger number of particle the speed won't collapse?

Here there are info about the timestep. Do you need other?
timesteps.txt

@MatthieuSchaller
Copy link
Member

no, I mainly mean that for a fixed number of particles, there is always a point beyond which using more cores won't help.

From this file, it looks like it's taking ~2s per step consistently so not bad I'd think.

I would use the Intel compiler btw. That will speed things up.

@Francyrad
Copy link

Thank you for the suggestion. I installed intel compilers and compiles succesfully:

./configure --with-tbbmalloc --with-hydro=planetary --with-equation-of-state=planetary CC=icx CXX=icpc

After make, i have this error:

libtool: link: icx -I./src -I./argparse -I/usr/include -I/usr/include/hdf5/serial "-DENGINE_POLICY=engine_policy_keep | engine_policy_setaffinity" -ipo -O3 -ansi-alias -march=skylake-avx512 -fma -ftz -fomit-frame-pointer -axCORE-AVX512 -mavx512vbmi -Qunused-arguments -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -Wall -Wextra -Wno-unused-parameter -Wshadow -Werror -Wstrict-prototypes -ipo -o swift swift-swift.o  -L/usr/lib/x86_64-linux-gnu/hdf5/serial src/.libs/libswiftsim.a argparse/.libs/libargparse.a -lgsl -lgslcblas -lhdf5 -lcrypto -lcurl -lsz -lz -ldl -lfftw3_threads -lfftw3 -lnuma -ltbbmalloc_proxy -ltbbmalloc -lpthread -lm -pthread
/usr/bin/ld: src/.libs/libswiftsim.a: error adding symbols: archive has no index; run ranlib to add one
icx: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [Makefile:710: swift] Errore 1
make[2]: uscita dalla directory «/home/francesco-radica/Documenti/swiftsim»
make[1]: *** [Makefile:829: all-recursive] Errore 1
make[1]: uscita dalla directory «/home/francesco-radica/Documenti/swiftsim»
make: *** [Makefile:598: all] Errore 2

I did ranlib libswiftsim.a in the correct folder (.libs) then make clean and make again. It doesn't work, i have no idea how to solve honestly...

Thank you for the suggesions and help. Do you think that in general intel's compilers make every simulator that uses OpenMPI faster in general or it's just a swiftsim case?

boson112358 referenced this issue in boson112358/SWIFT Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants