Single node performance

Performance of base Newtonian solver configurations

Typical performance figures for Athena++ are presented in Tables 1 and 2 for serial single-core and MPI-parallelized multicore results on several target architectures. The default second-order temporally-accurate predictor-corrector scheme (time/integrator=vl2) was used in all cases, but the reconstruction method of the corrector step was set to either PLM (time/xorder=2) or PPM (time/xorder=3). Various Riemann solvers were tested.

These figures can be considered as the benchmark values for the basic hydrodynamics and magnetohydrodynamics solver capabilities of the code; optional functionality such as non-Cartesian and/or nonuniform Coordinate Systems and Meshes, Special Relativity, General Relativity, Shearing Box, Self Gravity with FFT, etc. should all be measured relative to the below values.

**Table 1**: Single-core performance
			MZone-cycles/sec
			Xeon Phi KNL 7210	Broadwell E5-2680 v4	Skylake-SP Gold 6148
Hydro Sod	PLM	HLLC	1.533	2.730	4.503
		HLLE	1.618	2.868	4.880
		Roe	1.555	2.872	4.654
	PPM	HLLC	0.752	1.336	2.411
		HLLE	0.762	1.365	2.528
		Roe	0.762	1.361	2.424
MHD Brio-Wu	PLM	HLLD	0.705	1.340	2.403
		HLLE	0.803	1.406	2.307
		Roe	0.649	1.143	1.921
	PPM	HLLD	0.392	0.719	1.291
		HLLE	0.419	0.749	1.259
		Roe	0.373	0.666	1.119

**Table 2**: Full node performance
			MZone-cycles/sec
			Xeon Phi KNL 7210	(2x) Broadwell E5-2680 v4	(2x) Skylake-SP Gold 6148
Hydro Sod	PLM	HLLC	66.908	29.444	40.750
		HLLE	67.405	29.440	40.764
		Roe	67.094	29.418	40.820
	PPM	HLLC	40.279	20.269	32.248
		HLLE	40.182	20.328	32.403
		Roe	40.196	20.336	32.234
MHD Brio-Wu	PLM	HLLD	30.886	16.244	22.711
		HLLE	32.526	16.483	22.757
		Roe	29.145	15.140	22.673
	PPM	HLLD	19.378	11.123	17.733
		HLLE	20.430	11.313	17.684
		Roe	18.972	10.623	17.495

Notes on methodology:

Both benchmark problems are 3D shock tube problems using the adiabatic equation of state.
Each table entry represents the mean of 20 trials of independent, exclusive compute node Slurm allocations on clusters managed by Princeton Research Computing
The solver was configured with --nghost=2 for all PLM tests and --nghost=4 for all PPM tests.
The Intel C++ Compiler version 18.0.3 was used to generate all of these results. The only compiler flags used are those defined by the latest version's --cxx=icc [Configuring] option.
- Similarly, the 2018 Revision 3 Intel MPI library was the only MPI library used for the multicore study
Table 2 uses the same problem size per-core as the single-core tests in Table 1. Flat MPI is used to parallelize the problem with 1 rank assigned per physical core.
- However, the multicore tests on the KNL were the only set to use hybrid OpenMP+MPI parallelization. Assigning 4 OpenMP threads (each assigned a 64x32x32 MeshBlock) per MPI rank achieved high performance utilizing the 4-way hyperthreading of the 64x physical cores (256 logical cores) on these nodes. See the discussion in Using MPI and OpenMP.

KNL-specific details:

Flat memory mode. Cache memory mode was simulated by prepending the binary call with numactl -p 1 ./athena ... to prefer that Athena++ used the ~16 GB of MCDRAM.
Quadrant clustering mode.

Performance cost of optional code features

Under construction.

Home

Getting Started

User Guide

Programmer Guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single node performance

Performance of base Newtonian solver configurations

Performance cost of optional code features

Clone this wiki locally