Skip to content

Single node performance

Kyle Gerard Felker edited this page Jul 26, 2018 · 3 revisions

Performance of base Newtonian solver configurations

Typical performance figures for Athena++ are presented in Tables 1 and 2 for serial single-core and MPI-parallelized multicore results on several target architectures. The default second-order temporally-accurate predictor-corrector scheme (time/integrator=vl2) was used in all cases, but the reconstruction method of the corrector step was set to either PLM (time/xorder=2) or PPM (time/xorder=3). Various Riemann solvers were tested.

These figures can be considered as the benchmark values for the basic hydrodynamics and magnetohydrodynamics solver capabilities of the code; optional functionality such as non-Cartesian and/or nonuniform Coordinate Systems and Meshes, Special Relativity, General Relativity, Shearing Box, Self Gravity with FFT, etc. should all be measured relative to the below values.

Table 1: Single-core performance
MZone-cycles/sec
Xeon Phi
KNL 7210
Broadwell
E5-2680 v4
Skylake-SP
Gold 6148
Hydro Sod PLM HLLC 1.533 2.730 4.503
HLLE 1.618 2.868 4.880
Roe 1.555 2.872 4.654
PPM HLLC 0.752 1.336 2.411
HLLE 0.762 1.365 2.528
Roe 0.762 1.361 2.424
MHD Brio-Wu PLM HLLD 0.705 1.340 2.403
HLLE 0.803 1.406 2.307
Roe 0.649 1.143 1.921
PPM HLLD 0.392 0.719 1.291
HLLE 0.419 0.749 1.259
Roe 0.373 0.666 1.119
Table 2: Full node performance
MZone-cycles/sec
Xeon Phi
KNL 7210
(2x) Broadwell
E5-2680 v4
(2x) Skylake-SP
Gold 6148
Hydro Sod PLM HLLC 66.908 29.444 40.750
HLLE 67.405 29.440 40.764
Roe 67.094 29.418 40.820
PPM HLLC 40.279 20.269 32.248
HLLE 40.182 20.328 32.403
Roe 40.196 20.336 32.234
MHD Brio-Wu PLM HLLD 30.886 16.244 22.711
HLLE 32.526 16.483 22.757
Roe 29.145 15.140 22.673
PPM HLLD 19.378 11.123 17.733
HLLE 20.430 11.313 17.684
Roe 18.972 10.623 17.495

Notes on methodology:

  • Both benchmark problems are 3D shock tube problems using the adiabatic equation of state.
  • Each table entry represents the mean of 20 trials of independent, exclusive compute node Slurm allocations on clusters managed by Princeton Research Computing
  • The solver was configured with --nghost=2 for all PLM tests and --nghost=4 for all PPM tests.
  • The Intel C++ Compiler version 18.0.3 was used to generate all of these results. The only compiler flags used are those defined by the latest version's --cxx=icc [Configuring] option.
    • Similarly, the 2018 Revision 3 Intel MPI library was the only MPI library used for the multicore study
  • Table 2 uses the same problem size per-core as the single-core tests in Table 1. Flat MPI is used to parallelize the problem with 1 rank assigned per physical core.
    • However, the multicore tests on the KNL were the only set to use hybrid OpenMP+MPI parallelization. Assigning 4 OpenMP threads (each assigned a 64x32x32 MeshBlock) per MPI rank achieved high performance utilizing the 4-way hyperthreading of the 64x physical cores (256 logical cores) on these nodes. See the discussion in Using MPI and OpenMP.

KNL-specific details:

  • Flat memory mode. Cache memory mode was simulated by prepending the binary call with numactl -p 1 ./athena ... to prefer that Athena++ used the ~16 GB of MCDRAM.
  • Quadrant clustering mode.

Performance cost of optional code features

Under construction.

Clone this wiki locally