Skip to content

Commit ad823dc

Browse files
authored
fix typo (#51)
* fix typo * correct calculation of the total data size consumed in broadcast add optimization * fix typo * fix typo Co-authored-by: zhaiyi <>
1 parent d5b2275 commit ad823dc

File tree

4 files changed

+7
-6
lines changed

4 files changed

+7
-6
lines changed

chapter_cpu_schedules/arch.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -151,8 +151,9 @@ the latency to access L1 cache is less than 1 ns, the L2 cache's latency is arou
151151
:label:`fig_cpu_memory`
152152

153153
A brief memory subsystem layout is illustrated in :numref:`fig_cpu_memory`.
154-
L1 and L2 caches are exclusive to each CPU core, and L3 cache is shared across the cores of the same CPU processor
155-
To processing on some data, a CPU will first check if the data exist at L1 cache, if not check L2 cache, if not check L3 cache, if not go to the main memory to retrieve the data and bring it all the way through L3 cache, L2 cache, and L1 cache, finally to the CPU registers.
154+
L1 and L2 caches are exclusive to each CPU core, and L3 cache is shared across the cores of the same CPU processor.
155+
156+
To process on some data, a CPU will first check if the data exist at L1 cache, if not check L2 cache, if not check L3 cache, if not go to the main memory to retrieve the data and bring it all the way through L3 cache, L2 cache, and L1 cache, finally to the CPU registers.
156157
This looks very expensive but luckily in practice, the programs have the [data locality patterns](https://en.wikipedia.org/wiki/Locality_of_reference) which will accelerate the data retrieving procedure. There are two types of locality: temporal locality and spatial locality.
157158
Temporal locality means that the data we just used usually would be used in the near future so that they may be still in cache. Spatial locality means that the adjacent data of the ones we just used are likely to be used in the near future. As the system always brings a block of values to the cache each time (see the concept of [cache lines](https://en.wikipedia.org/wiki/CPU_cache#CACHE-LINES)), those adjacent data may be still in cache when referenced to.
158159
Leveraging the advantage brought by data locality is one of the most important performance optimization principles we will describe in detail later.

chapter_cpu_schedules/block_matmul.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ In each submatrix computation, we need to write a `[tx, ty]` shape matrix, and r
4343

4444
Let's implement this idea. In the following code block, we choose `tx=ty=32` and `tk=4` so that the submatrix to write has a size of `32*32*4=4KB` and the total size of the two submatrices to read is `2*32*4*4=1KB`. The three matrices together can fit into our L1 cache easily. The tiling is implemented by the `tile` primitive.
4545

46-
After tiling, we merge the outer width and height axes into a single one using the `fuse` primitive, so we can parallelize it. It means that we will compute blocks in parallel. Within a block, we split the reduced axis, reorder the axes as we did in:numref:`ch_matmul_cpu`, and then vectorize the innermost axis using SIMD instructions, and unroll the second innermost axis using the `unroll` primitive, namely the inner reduction axis.
46+
After tiling, we merge the outer width and height axes into a single one using the `fuse` primitive, so we can parallelize it. It means that we will compute blocks in parallel. Within a block, we split the reduced axis, reorder the axes as we did in :numref:`ch_matmul_cpu`, and then vectorize the innermost axis using SIMD instructions, and unroll the second innermost axis using the `unroll` primitive, namely the inner reduction axis.
4747

4848
```{.python .input n=10}
4949
tx, ty, tk = 32, 32, 4 # tile sizes

chapter_cpu_schedules/broadcast_add.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ np_bcast_add = lambda s1, s2: timeit.Timer(setup='import numpy as np\n'
3030
exe_times = [d2ltvm.bench_workload(np_bcast_add((n, 1), (n, n)).timeit) for n in sizes]
3131
np_gflops = sizes * sizes / 1e9 / np.array(exe_times)
3232
# data size in MB
33-
x_axis_sizes = (sizes * sizes * 2 + sizes * sizes) * 4 / 1e6
33+
x_axis_sizes = (sizes * sizes * 2 + sizes * 1) * 4 / 1e6
3434
d2ltvm.plot_gflops(x_axis_sizes, [np_gflops], ['numpy'], xlabel='Size (MB)')
3535
```
3636

@@ -77,7 +77,7 @@ s, args = good_schedule(64)
7777
print(tvm.lower(s, args, simple_mode=True))
7878
```
7979

80-
Now the C-like pseudo code should be familiar to you. One notable difference from :numref:ch_vector_add_cpu is that we broadcast `A[x]` to a vectorized register (i.e. `x64(A[x]`) for vectorized add.
80+
Now the C-like pseudo code should be familiar to you. One notable difference from :numref:`ch_vector_add_cpu` is that we broadcast `A[x]` to a vectorized register (i.e. `x64(A[x]`) for vectorized add.
8181

8282
Let's benchmark the good schedule.
8383

chapter_cpu_schedules/call_overhead.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ exe_times = np.array([bench_workload(np_copy(n).timeit) for n in sizes])
9696
print('NumPy call overhead: %.1f microsecond' % (exe_times.mean()*1e6))
9797
```
9898

99-
The overhead of TVM is higher but in the same order of magnitude.
99+
The overhead of TVM is lower but in the same order of magnitude.
100100

101101
```{.python .input n=12}
102102
def tvm_copy(n):

0 commit comments

Comments
 (0)