@@ -14,18 +14,28 @@ Terry Sun; Arch Linux, Intel i5-4670, GTX 750
1414
1515![ ] ( images/nbody_perf_plot.png ) 
1616
17+ The graph shows time taken (in ms) to update one frame at block sizes from 16
18+ to 1024 in steps of 8, for various values of N (planets in the system).
19+ 
1720I measured performance by disabling visualization and using ` CudaEvent ` s to time
1821the kernel invocations (measuring the time elapsed for both ` kernUpdateVelPos ` 
19- and ` kernUpdateAcc ` ). The graph shows time elapsed (in ms) to update one frame
20- at block sizes from 16 to 1024 in steps of 8.
22+ and ` kernUpdateAcc ` ). The recorded value is an average over 100 frames.
2123
2224Code for performance measuring can be found on the ` performance `  branch.
2325
2426Changing the number of planets, as expected, increases the time elapsed for the
25- kernels, due to a for-loop in the acceleration calculation (which increases
26- linearly by  the number of total planets in the system. More interestingly, it
27+ kernels, due to a for-loop in the acceleration calculation (which increases the 
28+ time with  the number of total planets in the system) . More interestingly, it
2729also changes the way that performance reacts to block size (see n=4096 in the
28- above plot).
30+ above plot). The difference in performance as block size changes is much greater
31+ with greater N, and also exhibits different behaviors.
32+ 
33+ At certain block sizes, the time per frame sharply decreases, such as at N=4096,
34+ block size=1024, 512, 256, 128. These are points where each block would be
35+ saturated (ie. no threads are started that are not needed).
36+ 
37+ I have no idea what's going on with the spikes peaking at N=4096, block size~ 800
38+ or N=3072, block size~ 600.
2939
3040# Part2: An Even More Basic Matrix Library  
3141
0 commit comments