Project 1: Memory Hierarchies and Performance of Serial Applications

Due: EOD, Sep 10

Learning goals

In this project you will explore how memory hierarchies affect performance for serial applications.

arithmetic intensity: how to compute; divides; memory access
at least 2 different system architectures from ICE. Compare and contrast.

Note that all of these aspects of serial applications are transferable to parallel applications, as we shall see later in the course.

In addition, you will explore leveraging GPT tools for HPC developments. You may use ChatGPT, Gemini or similar GPT tools for brainstorming and boilerplate. For each place you used GPT, include in a GPT Notes appendix (≤1 page total):

the prompt you used,
a one-sentence summary of the answer, and
how you verified/modified it.

You are still responsible for correctness and performance measurement.

Warm-up

Review the section in HPSC2020 on computing arithmetic intensity for given compute kernels. Then, as a group, compute the arithmetic intensities of the following kernels in units of FLOPs/byte, assuming 8 bytes per double.

  Y[j] += Y[j] + A[j][i] * B[i]

  s += A[i] * A[i]

  s += A[i] * B[i]

  Y[i] = A[i] + C*B[i]

Included a table in your project report listing the arithmetic intensities for these kernels.

Ask GPT: “What is the arithmetic intensity (FLOPs/byte) of s += A[i] * B[i] assuming doubles, and what does that imply about likely bottlenecks?” Paste a 2–3 sentence summary into GPT Notes.

In one sentence, state whether each kernel tends to be compute-bound or bandwidth-bound on a typical CPU, and why. You may use GPT to figure these out.

Part 1: Matrix-matrix Multiplication

In this first part of the project, you will test the performance of the basic matrix-matrix multiplication operation as a function of matrix size on multiple compute platforms. Complete the following using at least two different compute architectures (e.g., two different node architectures Intel vs AMD on ICE).

With your group, write a program that multiplies two matrices together (see: http://mathworld.wolfram.com/MatrixMultiplication.html), i.e, C=A*B Use GitHub to manage the code development and version history.
For a given matrix size N, what is the total number of floating point operations performed by this operator?
Using the supplied C routine get_walltime.c, or any other accurate means of measuring time, compute the performance in Gflop/s of the matrix-matrix multiply for N=128. Be sure to perform enough repeat calculations of the timing to overcome any statistical noise in the measurement.
For the system you are running on, determine the clock speed of the processor and the cache size/layout. Use this information to estimate the theoretical peak performance of the system, assuming that the processor is capable of one flop per clock cycle (generally NOT true on modern architectures). How does the performance you measured in (3) compare to the theoretical peak performance of your system?
Now repeat the performance measurement for a range of matrix size N from 64 to 8,192. Make a plot of the resulting measured Gflop/s vs. N. On this plot place a horizontal line representing the theoretical peak performance based upon your system's clock speed.
How does the measured performance for multiple N's compare to peak? Are there any "features" in your plot? Explain them in the context of the hardware architecture of your system. Include in your write-up a description of your system's architecture (processor, cache, etc.).
Time two loop orderings (e.g., i-j-k vs i-k-j) at N=512 only. Report which is faster and give a one-sentence reason about memory access order (row-major vs reuse). You may use GPT to figure out the reasoning.
Compare your result at N=256 against dgemm (OpenBLAS on ICE). Report the relative error |C_ref-C|/|C_ref|. You may use GPT to generate such a code and the proper compile command on ICE.

To your project git repo, commit your code for performing the matrix-matrix multiply performance measurements, the plots of your results, and a brief write-up (in plain text or markdown) addressing the above questions and discussing your results. Pay particular attention to the comparison between different architectures and give explanations for them.

Part 2: The Roofline Model

In this part, you will explore the roofline model and the Empirical Roofline Tool (ERT) for analyzing the interplay between arithmetic intensity and memory bandwidth for architectures with complex memory hierarchies. Complete the following exercises on the SAME compute architectures that you used in Part 1 above.

First, watch a short NERSC video "Introduction to the Roofline Model". Then reference the materials on the Roofline Performance model at https://crd.lbl.gov/divisions/amcr/computer-science-amcr/par/research/roofline/. In particular, look through "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures" and the slides at https://crd.lbl.gov/assets/pubs_presos/parlab08-roofline-talk.pdf. Nothing needs to be turned in.
Clone the CS Roofline Toolkit, git clone https://bitbucket.org/berkeleylab/cs-roofline-toolkit.git. Modify one of the config files in Empirical_Roofline_Tool-1.1.0/Config as necessary for the machine you are using.
Run the ERT in serial mode on one of your machines. Report the peak performances and bandwidths (for all caches levels as well as DRAM). Where is the "ridge point" of the roofline for the various cases?
Consider the four FP kernels in "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures" (see their Table 2). Assuming the high end of operational (i.e., "arithmetic") intensity, how would these kernels perform on the platforms you are testing? What optimization strategy would you recommend to increase performance of these kernels?
Address the same questions in (4) for the four kernels given in the Warm-up above.
Compare your results for the roofline model to what you obtained for the matrix-matrix multiplication operation from Part 1. How are the rooflines of memory bandwidth related to the features in the algorithmic performance as a function of matrix size?
Estimate the arithmetic intensity using a simple memory traffic model (assume 8 bytes per double and write-allocate for stores). Using your system’s measured or reported memory bandwidth (GB/s) and theoretical compute roof (GFLOP/s from clock speed × FLOPs/cycle), use the Roofline model to predict the expected performance (GFLOP/s). State whether the kernel is predicted to be compute-bound or bandwidth-bound. Measure the performance of your MMM code at N=2048. Compare the prediction and the measurement. If your measurement is significantly lower, give one or two possible reasons (e.g., lack of vectorization, unblocked loop order, cache behavior).

To your project write-up, add your plots of the roofline model for the systems you tested, and responses addressing the above questions.

HPC practice

Compile with optimization -O2 or -O3
Run with one process (serial)
Average over at least 3–5 runs for stability
Find the required CPU information through lscpu

Policy

GPT-generated code must be compiled, run, and verified by you (e.g., BLAS check at N=256).

Hint

On ICE, I suggest installing your own Python 2 and Gnuplot. The easiest way to do this is via conda after loading the Anaconda module using module load anaconda3. I recommend creating an environment named py27 for Python 2.7. Then, install Gnuplot using conda install conda-forge::gnuplot.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
get_walltime.c		get_walltime.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project 1: Memory Hierarchies and Performance of Serial Applications

Due: EOD, Sep 10

Learning goals

Warm-up

Part 1: Matrix-matrix Multiplication

Part 2: The Roofline Model

HPC practice

Policy

Hint

About

Uh oh!

Releases

Packages

Languages

CSE6230/project1_template

Folders and files

Latest commit

History

Repository files navigation

Project 1: Memory Hierarchies and Performance of Serial Applications

Due: EOD, Sep 10

Learning goals

Warm-up

Part 1: Matrix-matrix Multiplication

Part 2: The Roofline Model

HPC practice

Policy

Hint

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages