fix typo (#51)

zhaiyi000 · web-flow · commit ad823dc4d150 · 2022-06-04T14:02:16.000-07:00
* fix typo

* correct calculation of the total data size consumed in broadcast add optimization

* fix typo

* fix typo

Co-authored-by: zhaiyi &lt;&gt;
diff --git a/chapter_cpu_schedules/arch.md b/chapter_cpu_schedules/arch.md
@@ -151,8 +151,9 @@ the latency to access L1 cache is less than 1 ns, the L2 cache's latency is arou
 :label:`fig_cpu_memory`
 
 A brief memory subsystem layout is illustrated in :numref:`fig_cpu_memory`.
-L1 and L2 caches are exclusive to each CPU core, and L3 cache is shared across the cores of the same CPU processor
-To processing on some data, a CPU will first check if the data exist at L1 cache, if not check L2 cache, if not check L3 cache, if not go to the main memory to retrieve the data and bring it all the way through L3 cache, L2 cache, and L1 cache, finally to the CPU registers.
+L1 and L2 caches are exclusive to each CPU core, and L3 cache is shared across the cores of the same CPU processor. 
+
+To process on some data, a CPU will first check if the data exist at L1 cache, if not check L2 cache, if not check L3 cache, if not go to the main memory to retrieve the data and bring it all the way through L3 cache, L2 cache, and L1 cache, finally to the CPU registers.
 This looks very expensive but luckily in practice, the programs have the [data locality patterns](https://en.wikipedia.org/wiki/Locality_of_reference) which will accelerate the data retrieving procedure. There are two types of locality: temporal locality and spatial locality.
 Temporal locality means that the data we just used usually would be used in the near future so that they may be still in cache. Spatial locality means that the adjacent data of the ones we just used are likely to be used in the near future. As the system always brings a block of values to the cache each time (see the concept of [cache lines](https://en.wikipedia.org/wiki/CPU_cache#CACHE-LINES)), those adjacent data may be still in cache when referenced to.
 Leveraging the advantage brought by data locality is one of the most important performance optimization principles we will describe in detail later.
diff --git a/chapter_cpu_schedules/block_matmul.md b/chapter_cpu_schedules/block_matmul.md
@@ -43,7 +43,7 @@ In each submatrix computation, we need to write a `[tx, ty]` shape matrix, and r
 
 Let's implement this idea. In the following code block, we choose `tx=ty=32` and `tk=4` so that the submatrix to write has a size of `32*32*4=4KB` and the total size of the two submatrices to read is `2*32*4*4=1KB`. The three matrices together can fit into our L1 cache easily. The tiling is implemented by the `tile` primitive.
 
-After tiling, we merge the outer width and height axes into a single one using the `fuse` primitive, so we can parallelize it. It means that we will compute blocks in parallel. Within a block, we split the reduced axis, reorder the axes as we did in:numref:`ch_matmul_cpu`, and then vectorize the innermost axis using SIMD instructions, and unroll the second innermost axis using the `unroll` primitive, namely the inner reduction axis.
+After tiling, we merge the outer width and height axes into a single one using the `fuse` primitive, so we can parallelize it. It means that we will compute blocks in parallel. Within a block, we split the reduced axis, reorder the axes as we did in :numref:`ch_matmul_cpu`, and then vectorize the innermost axis using SIMD instructions, and unroll the second innermost axis using the `unroll` primitive, namely the inner reduction axis.
 
 ```{.python .input  n=10}
 tx, ty, tk = 32, 32, 4  # tile sizes
diff --git a/chapter_cpu_schedules/broadcast_add.md b/chapter_cpu_schedules/broadcast_add.md
@@ -30,7 +30,7 @@ np_bcast_add = lambda s1, s2: timeit.Timer(setup='import numpy as np\n'
 exe_times = [d2ltvm.bench_workload(np_bcast_add((n, 1), (n, n)).timeit) for n in sizes]
 np_gflops = sizes * sizes / 1e9 / np.array(exe_times)
 # data size in MB
-x_axis_sizes = (sizes * sizes * 2 + sizes * sizes) * 4 / 1e6
+x_axis_sizes = (sizes * sizes * 2 + sizes * 1) * 4 / 1e6
 d2ltvm.plot_gflops(x_axis_sizes, [np_gflops], ['numpy'], xlabel='Size (MB)')
 ```
 
@@ -77,7 +77,7 @@ s, args = good_schedule(64)
 print(tvm.lower(s, args, simple_mode=True))
 ```
 
-Now the C-like pseudo code should be familiar to you. One notable difference from :numref:ch_vector_add_cpu is that we broadcast `A[x]` to a vectorized register (i.e. `x64(A[x]`) for vectorized add.
+Now the C-like pseudo code should be familiar to you. One notable difference from :numref:`ch_vector_add_cpu` is that we broadcast `A[x]` to a vectorized register (i.e. `x64(A[x]`) for vectorized add.
 
 Let's benchmark the good schedule.
 
diff --git a/chapter_cpu_schedules/call_overhead.md b/chapter_cpu_schedules/call_overhead.md
@@ -96,7 +96,7 @@ exe_times = np.array([bench_workload(np_copy(n).timeit) for n in sizes])
 print('NumPy call overhead: %.1f microsecond' % (exe_times.mean()*1e6))
 ```
 
-The overhead of TVM is higher but in the same order of magnitude.
+The overhead of TVM is lower but in the same order of magnitude.
 
 ```{.python .input  n=12}
 def tvm_copy(n):