@@ -115,13 +115,13 @@ PyTorch has no knowledge of the *algorithm* you are implementing. It knows only
115
115
of the individual operations you use to compose your algorithm. As such, PyTorch
116
116
must execute your operations individually, one after the other. Since each
117
117
individual call to the implementation (or *kernel *) of an operation, which may
118
- involve launch of a CUDA kernel, has a certain amount of overhead, this overhead
119
- may become significant across many function calls. Furthermore, the Python
120
- interpreter that is running our code can itself slow down our program.
118
+ involve the launch of a CUDA kernel, has a certain amount of overhead, this
119
+ overhead may become significant across many function calls. Furthermore, the
120
+ Python interpreter that is running our code can itself slow down our program.
121
121
122
122
A definite method of speeding things up is therefore to rewrite parts in C++ (or
123
123
CUDA) and *fuse * particular groups of operations. Fusing means combining the
124
- implementations of many functions into a single functions , which profits from
124
+ implementations of many functions into a single function , which profits from
125
125
fewer kernel launches as well as other optimizations we can perform with
126
126
increased visibility of the global flow of data.
127
127
@@ -509,12 +509,12 @@ and with our new C++ version::
509
509
Forward: 349.335 us | Backward 443.523 us
510
510
511
511
We can already see a significant speedup for the forward function (more than
512
- 30%). For the backward function a speedup is visible, albeit not major one. The
513
- backward pass I wrote above was not particularly optimized and could definitely
514
- be improved. Also, PyTorch's automatic differentiation engine can automatically
515
- parallelize computation graphs, may use a more efficient flow of operations
516
- overall, and is also implemented in C++, so it's expected to be fast.
517
- Nevertheless, this is a good start.
512
+ 30%). For the backward function, a speedup is visible, albeit not a major one.
513
+ The backward pass I wrote above was not particularly optimized and could
514
+ definitely be improved. Also, PyTorch's automatic differentiation engine can
515
+ automatically parallelize computation graphs, may use a more efficient flow of
516
+ operations overall, and is also implemented in C++, so it's expected to be
517
+ fast. Nevertheless, this is a good start.
518
518
519
519
Performance on GPU Devices
520
520
**************************
@@ -571,7 +571,7 @@ And C++/ATen::
571
571
572
572
That's a great overall speedup compared to non-CUDA code. However, we can pull
573
573
even more performance out of our C++ code by writing custom CUDA kernels, which
574
- we'll dive into soon. Before that, let's dicuss another way of building your C++
574
+ we'll dive into soon. Before that, let's discuss another way of building your C++
575
575
extensions.
576
576
577
577
JIT Compiling Extensions
@@ -851,7 +851,7 @@ and ``Double``), you can use ``AT_DISPATCH_ALL_TYPES``.
851
851
852
852
Note that we perform some operations with plain ATen. These operations will
853
853
still run on the GPU, but using ATen's default implementations. This makes
854
- sense, because ATen will use highly optimized routines for things like matrix
854
+ sense because ATen will use highly optimized routines for things like matrix
855
855
multiplies (e.g. ``addmm ``) or convolutions which would be much harder to
856
856
implement and improve ourselves.
857
857
@@ -903,7 +903,7 @@ You can see in the CUDA kernel that we work directly on pointers with the right
903
903
type. Indeed, working directly with high level type agnostic tensors inside cuda
904
904
kernels would be very inefficient.
905
905
906
- However, this comes at a cost of ease of use and readibility , especially for
906
+ However, this comes at a cost of ease of use and readability , especially for
907
907
highly dimensional data. In our example, we know for example that the contiguous
908
908
``gates `` tensor has 3 dimensions:
909
909
@@ -920,7 +920,7 @@ arithmetic.
920
920
gates.data<scalar_t>()[n*3*state_size + row*state_size + column]
921
921
922
922
923
- In addition to being verbose, this expression needs stride to be explicitely
923
+ In addition to being verbose, this expression needs stride to be explicitly
924
924
known, and thus passed to the kernel function within its arguments. You can see
925
925
that in the case of kernel functions accepting multiple tensors with different
926
926
sizes you will end up with a very long list of arguments.
0 commit comments