Pointwise fusion for GPU (#15167)

* Beginning of RTC of pointwise ops * Code generation from the given JSON * add initial simple_partition_pass and use it for pointwise fusion * fix the fusion, use a symbol.Copy() at the beginning of binding function, use the name of input nodes in the cuda code * Fixes * Adding support for attribute inference for backward nodes when fusing * keep proper input ordering for fused Op * instantiate the indexed_graph before starting the subgraph replacement, return a new graph to reset the indexed_graph * Fuse backward * fix ordering of subgraph node inputs using subgraph topological ordering instead of main graph topological ordering, add tvm.patch * excluse forward node fusion during the fusion of the nodes in the backward graph * Dealing with fused backward nodes inferattr * use subgraph.indexed_graph() instead of main for _FusedOpHelper nodes node_id, invert control_deps loop to modify topology of subgraph before calling its indexed_graph(), check that all node of the first DFSVisit are actually in the subgraph * Adding support for other reqs in codegen * Fix * Cleaning * Change the TVM submodule * More cleaning * Making linter happy * Do fusion only if default context is GPU * Fixes for tests Add powerscalar and rpowerscalar, fix return type of zero and one Cleaning, fixing lint Go back to proper TVM submodule * Fix the TVM commit * Fix lint * Guard fusion with MXNET_USE_CUDA * Fix * Fix clang-tidy * Add erf and erfinv backward * Gluon support for fusion * Cleaning * Cleaning and allow shape/type change in FusedOp * Fixing Gluon bugs * Fixing after rebase * Fixing race condition and guarding against races when using NVRTC * Cleaning and renaming FusedOp to _FusedOp * Going easy on Windows compiler * Disable fusion on Windows for now * Refactor InferAttr and InferShapeAttr * Added slice and half2 support to FusedOp * Fix lint errors * Added multiple types support for vector loading/storing * add slice fusion when it's at the beginning of subgraphs * Removed constant ndim assumption in fused op * Fix memory alignment issue in slice for FusedOp * Fixes * Fix lint errors * Do not include cuda_fp16.h * Refactor fused op op lists * Make linter happy * Changes from review * Fixes after rebase * Expand FusedOp support for slice * Fix for fp16 _zeros and _ones * Fix * Moving aux functions to unnamed namespace and detail namespace -> fusion namespace * Disabling fusion if it alters topological order of inputs * Print code only when env variable is set * Fix * Fix lint and 2 tests that specify the same names for multiple inputs * Fixes from review and disabling fusion of slice with non-default step * Add amp_cast to fusion, fixes * Add amp_multicast and its backward to the list of support ops * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <[email protected]> * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <[email protected]> * Make clearer comment * Adding punctuation and capitalization to \brief descriptions * Fix * Fix * Add backward_cast to fusion * Adding unittests for fusion. Fix for erfinv_grad * Adding slice ops and add_n to tests * Fixes from review * Setting inplace option * Fix lint * Storing double in half * Retrigger CI * Slight relaxing of the relative tolerance in the test * Move the env variable check to the end * Fix a race condition between InferShape and scheduled Forward * Fix flakey test_fusion test involving fp32 erfinv op. * Fix from review * Added broadcast_like and slice_like to fused op * Minor fix and cleanup * Added negative axis support in slice_axis, temporarily disabled fusion of slice_like and broadcast_like * Added axes support to slice_like * Added axis support to broadcast_like * Add fast_load_slice function to fused op code * Added runtime switch for choosing fast and slow slice kernel * Fix lint and warning * Going easy on Windows compiler (again) * Fix slice_like * Debug broadcast_like fusion * Fix lint * Fix lint * Trigger CI * Get rid of the initializer list * Fix backward calls with different gradient type * avoid cycle when adding node specific for inputs of subgraph for pointwise fusion * Fix lint * Add namespace to the fusion implementations * Set launch bounds on the fused kernel * Fix NumPy tests * Test showcasing an issue fixed in PR #16553 * Cast scalarts to FP32 and perform (a*1.0/b) instead of (a/b) Fix lint errors Fix lint * Fix a bug in cycle detection for inputs only op in pointwise fusion * Add comments to simple_partition_pass.h file
apache · Nov 6, 2019 · 51c2065 · 51c2065
1 parent 0cbee04
commit 51c2065
Show file tree

Hide file tree

Showing 20 changed files with 3,862 additions and 216 deletions.
diff --git a/docs/static_site/src/pages/api/faq/env_var.md b/docs/static_site/src/pages/api/faq/env_var.md
@@ -200,12 +200,12 @@ The following environments can be used to profile the application without changi
 
 * MXNET_PROFILER_AUTOSTART
   - Values: 0(false) or 1(true) ```(default=0)```
-	- Set to 1, MXNet starts the profiler automatically. The profiling result is stored into profile.json in the working directory.
+  - Set to 1, MXNet starts the profiler automatically. The profiling result is stored into profile.json in the working directory.
 
 * MXNET_PROFILER_MODE
   - Values: 0(false) or 1(true) ```(default=0)```
-	- If set to '0', profiler records the events of the symbolic operators.
-	- If set to '1', profiler records the events of all operators.
+  - If set to '0', profiler records the events of the symbolic operators.
+  - If set to '1', profiler records the events of all operators.
 
 ## Interface between Python and the C API
 
@@ -241,14 +241,14 @@ If ctypes is used, it must be `mxnet._ctypes.ndarray.NDArrayBase`.
 
 * MXNET_CUDA_ALLOW_TENSOR_CORE
   - 0(false) or 1(true) ```(default=1)```
-	- If set to '0', disallows Tensor Core use in CUDA ops.
-	- If set to '1', allows Tensor Core use in CUDA ops.
+  - If set to '0', disallows Tensor Core use in CUDA ops.
+  - If set to '1', allows Tensor Core use in CUDA ops.
   - This variable can only be set once in a session.
 
 * MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION
   - 0(false) or 1(true) ```(default=0)```
-	- If set to '0', disallows implicit type conversions to Float16 to use Tensor Cores
-	- If set to '1', allows CUDA ops like RNN and Convolution to use TensorCores even with Float32 input data by using implicit type casting to Float16. Only has an effect if `MXNET_CUDA_ALLOW_TENSOR_CORE` is `1`.
+  - If set to '0', disallows implicit type conversions to Float16 to use Tensor Cores
+  - If set to '1', allows CUDA ops like RNN and Convolution to use TensorCores even with Float32 input data by using implicit type casting to Float16. Only has an effect if `MXNET_CUDA_ALLOW_TENSOR_CORE` is `1`.
 
 * MXNET_CUDA_LIB_CHECKING
   - 0(false) or 1(true) ```(default=1)```
@@ -328,6 +328,17 @@ If ctypes is used, it must be `mxnet._ctypes.ndarray.NDArrayBase`.
     with float32.
   - Model accuracies do not necessarily improve with this environment variable turned on.
 
+* MXNET_USE_FUSION
+  - Values: 0(false) or 1(true) ```(default=1)```
+  - If this variable is set, MXNet will try fusing some of the operations (pointwise operations only for now).
+  - It works in Symbolic execution as well as in Gluon models hybridized with ```static_alloc=True``` option.
+  - Only applies to MXNet that has been compiled with CUDA (```pip install mxnet-cuXX``` or built from source with ```USE_CUDA=1```) and running on GPU.
+
+* MXNET_FUSION_VERBOSE
+  - Values: 0(false) or 1(true) ```(default=0)```
+  - Only applies to MXNet that has been compiled with CUDA and when ```MXNET_USE_FUSION``` option is enabled.
+  - If this variable is set, MXNet will print the code for fused operators that it generated.
+
 Settings for Minimum Memory Usage
 ---------------------------------
 - Make sure ```min(MXNET_EXEC_NUM_TEMP, MXNET_GPU_WORKER_NTHREADS) = 1```

diff --git a/src/common/exec_utils.cc b/src/common/exec_utils.cc
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file exec_utils.cc
+ * \brief Implementation of executor util functions.
+ */
+
+#include "exec_utils.h"
+#include <unordered_set>
+#include <unordered_map>
+#include <string>
+
+namespace mxnet {
+namespace common {
+
+void CopyGraph(nnvm::Graph *dst, const nnvm::Graph &src, bool copy_variables) {
+  using nnvm::Node;
+  using nnvm::NodePtr;
+  using nnvm::NodeEntry;
+  std::unordered_map<Node*, NodePtr> old_new;
+  // use DFSVisit to copy all the nodes
+  DFSVisit(src.outputs, [&old_new, copy_variables](const NodePtr& node) {
+      NodePtr np;
+      if (copy_variables || !node->is_variable()) {
+        np = Node::Create();
+        np->attrs = node->attrs;
+      } else {
+        np = node;
+      }
+      old_new[node.get()] = std::move(np);
+    });
+  // connect nodes of new graph
+  for (const auto &kv : old_new) {
+    for (const NodeEntry& e : kv.first->inputs) {
+      Node *ptr = e.node.get();
+      kv.second->inputs.emplace_back(NodeEntry{old_new[ptr], e.index, e.version});
+    }
+    for (const NodePtr& p : kv.first->control_deps) {
+      kv.second->control_deps.emplace_back(old_new[p.get()]);
+    }
+  }
+  // set the head
+  for (const NodeEntry &e : src.outputs) {
+    (*dst).outputs.emplace_back(NodeEntry{old_new[e.node.get()], e.index, e.version});
+  }
+}
+
+bool CheckForInputNameDuplicates(const nnvm::IndexedGraph &idx) {
+  std::unordered_set<std::string> names;
+  for (const auto& nid : idx.input_nodes()) {
+    const std::string &name = idx[nid].source->attrs.name;
+    if (names.count(name)) {
+      LOG(WARNING) << "Variable name " << name << " is used more than once!";
+      return false;
+    }
+    names.insert(name);
+  }
+  return true;
+}
+
+}  // namespace common
+}  // namespace mxnet
diff --git a/src/common/exec_utils.h b/src/common/exec_utils.h
@@ -621,6 +621,25 @@ inline nnvm::Graph AssignContext(nnvm::Graph g,
   return g;
 }
 
+/*!
+ * \brief Copy the graph, optionally leaving original Variable nodes.
+ *
+ * \param dst destination graph
+ * \param src source graph being copied
+ * \param copy_variable whether to copy or reuse Variable nodes from the
+ *                      source graph
+ */
+void CopyGraph(nnvm::Graph *dst, const nnvm::Graph &src, bool copy_variables);
+
+/*!
+ * \brief Check whether graph contains any duplicated names in its inputs.
+ *
+ * \param idx Indexed graph being checked
+ *
+ * \return true if there are no duplicates, false otherwise
+ */
+bool CheckForInputNameDuplicates(const nnvm::IndexedGraph &idx);
+
 }  // namespace common
 }  // namespace mxnet
 #endif  // MXNET_COMMON_EXEC_UTILS_H_

diff --git a/src/executor/exec_pass.h b/src/executor/exec_pass.h
@@ -34,10 +34,34 @@
 #include <vector>
 #include <memory>
 #include <string>
+#include <utility>
+#include <tuple>
 
 namespace mxnet {
 namespace exec {
 
+template <typename Attr>
+using FAccessSubgraphAttr = std::function<std::tuple<const nnvm::NodePtr,
+                                          std::vector<Attr>,
+                                          std::vector<Attr>>
+                              (const NodeAttrs& attrs)>;
+
+using FAccessSubgraphShape = FAccessSubgraphAttr<mxnet::TShape>;
+using FAccessSubgraphType = FAccessSubgraphAttr<int>;
+using FAccessSubgraphStorageType = FAccessSubgraphAttr<int>;
+
+template <typename Attr>
+using FProvideSubgraphAttr = std::function<void (const NodeAttrs& attrs,
+                                                 const std::vector<nnvm::NodePtr> &nodes,
+                                                 const std::vector<std::vector<Attr>> &in_attrs,
+                                                 const std::vector<std::vector<Attr>> &out_attrs)>;
+using FProvideSubgraphShape = FProvideSubgraphAttr<mxnet::TShape>;
+using FProvideSubgraphType = FProvideSubgraphAttr<int>;
+using FProvideSubgraphStorageType = FProvideSubgraphAttr<int>;
+
+using TIsFusion = bool;
+using TIsFusionHelper = bool;
+
 /*! \brief reuse graph definition */
 using nnvm::Graph;
 
@@ -170,6 +194,24 @@ void AttachOpResources(const Graph& g,
  */
 Graph DetectInplaceAddTo(Graph g);
 
+/*!
+ * \brief Fuse pointwise operations in the forward pass.
+ *
+ * \param g input graph (needs to be entire graph, not just forward part)
+ *
+ * \return graph with fused pointwise operations in the forward pass
+ */
+Graph FusePointwiseForward(Graph&& g);
+
+/*!
+ * \brief Fuse pointwise operations in the backward pass.
+ *
+ * \param g input graph (needs to be entire graph, not just forward part)
+ *
+ * \return graph with fused pointwise operations in the backward pass
+ */
+Graph FusePointwiseBackward(Graph&& g);
+
 /*!
  * \brief Infer shapes in the graph given the information.
  * \param graph The input graph.

diff --git a/src/executor/graph_executor.cc b/src/executor/graph_executor.cc
@@ -26,6 +26,7 @@
 #include <nnvm/graph.h>
 #include <nnvm/pass_functions.h>
 #include <vector>
+#include <set>
 #include <algorithm>
 
 #include "./exec_pass.h"
@@ -337,6 +338,7 @@ nnvm::Graph GraphExecutor::InitFullGraph(nnvm::Symbol symbol,
   if (!need_grad_) return g;
   for (size_t i = 0; i < g.outputs.size(); ++i) {
     NodeEntry ngrad(nnvm::Node::Create(), 0, 0);
+    ngrad.node->attrs.name = "_head_grad_" + std::to_string(i);
     head_grad_entry_.emplace_back(AttrHint(ngrad, g.outputs[i]));
     head_grad_map_[ngrad.node.get()] = i;
   }
@@ -377,6 +379,7 @@ nnvm::Graph GraphExecutor::InitFullGraph(nnvm::Symbol symbol,
   for (const auto &e : g_grad.outputs) {
     g.outputs.push_back(e);
   }
+
   return g;
 }
 
@@ -796,6 +799,7 @@ void GraphExecutor::Init(nnvm::Symbol symbol,
                          const nnvm::NodeEntryMap<NDArray>& feed_dict) {
   nnvm::Graph g = InitGraph(symbol, default_ctx, ctx_map, in_arg_ctxes, arg_grad_ctxes,
                             aux_state_ctxes, grad_req_types);
+
   // The following code of shape and dtype inferences and argument
   // initialization is for simple_bind only. Regular bind operation
   // should do this differently.
@@ -976,6 +980,7 @@ Executor* GraphExecutor::Reshape(const bool partial_shaping,
              this);
   return exec;
 }
+
 /*!
  * \brief This function is triggered by both simple_bind
  * and bind flows.
@@ -993,6 +998,41 @@ Graph GraphExecutor::InitGraph(nnvm::Symbol symbol,
   // setup gradient
   nnvm::Graph g = InitFullGraph(symbol, grad_req_types);
 
+#if MXNET_USE_CUDA && !defined(_WIN32)
+  if (default_ctx.dev_mask() == Context::kGPU && dmlc::GetEnv("MXNET_USE_FUSION", true)) {
+    nnvm::Graph unoptimized_graph;
+    common::CopyGraph(&unoptimized_graph, g, false);
+
+    if (common::CheckForInputNameDuplicates(unoptimized_graph.indexed_graph())) {
+      g.attrs["num_forward_outputs"] = std::make_shared<nnvm::any>(num_forward_outputs_);
+      g = FusePointwiseForward(std::move(g));
+      g.attrs["num_forward_outputs"] = std::make_shared<nnvm::any>(num_forward_outputs_);
+      g = FusePointwiseBackward(std::move(g));
+      // Check the topological order of inputs
+      const auto &original_inputs = unoptimized_graph.indexed_graph().input_nodes();
+      const auto &new_inputs = g.indexed_graph().input_nodes();
+      if (original_inputs.size() != new_inputs.size()) {
+        LOG(WARNING)
+          << "Number of inputs after fusion does not match original number of inputs. "
+          << "This is most probably a bug. Disabling fusion for this run.";
+        g = unoptimized_graph;
+      } else {
+        for (size_t i = 0; i < new_inputs.size(); ++i) {
+          if (unoptimized_graph.indexed_graph()[original_inputs[i]].source->attrs.name !=
+              g.indexed_graph()[new_inputs[i]].source->attrs.name) {
+            LOG(WARNING) << "Disabling fusion due to altered topological order of inputs.";
+            g = unoptimized_graph;
+            break;
+          }
+        }
+      }
+    } else {
+      LOG(WARNING)
+        << "Graph contains duplicate names for some of its inputs - fusion is NOT enabled!";
+     }
+  }
+#endif  // MXNET_USE_CUDA
+
   // create "device" and "context" attrs for the graph
   g = AssignContext(g, default_ctx, ctx_map,
                     in_arg_ctxes,
@@ -1946,7 +1986,7 @@ Executor *Executor::SimpleBind(nnvm::Symbol symbol,
       symbol = exec::BuildSubgraph(symbol, backend, arg_shape_map, arg_dtype_map, arg_stype_map,
                                    default_ctx, group2ctx, &tmp_in_arg_ctxes, &tmp_arg_grad_ctxes,
                                    &tmp_grad_req_types, &tmp_aux_state_ctxes, verbose);
-      exec->Init(symbol, default_ctx, group2ctx, tmp_in_arg_ctxes, tmp_arg_grad_ctxes,
+      exec->Init(symbol.Copy(), default_ctx, group2ctx, tmp_in_arg_ctxes, tmp_arg_grad_ctxes,
                  tmp_aux_state_ctxes, arg_shape_map, arg_dtype_map, arg_stype_map,
                  tmp_grad_req_types, shared_arg_names, &tmp_in_args, &tmp_arg_grads,
                  &tmp_aux_states, shared_buffer, shared_exec);
@@ -1985,7 +2025,7 @@ Executor *Executor::SimpleBind(nnvm::Symbol symbol,
   }
   if (!init) {
     // init without subgraph
-    exec->Init(symbol, default_ctx, group2ctx, in_arg_ctxes, arg_grad_ctxes, aux_state_ctxes,
+    exec->Init(symbol.Copy(), default_ctx, group2ctx, in_arg_ctxes, arg_grad_ctxes, aux_state_ctxes,
                arg_shape_map, arg_dtype_map, arg_stype_map, grad_req_types, shared_arg_names,
                in_args, arg_grads, aux_states, shared_buffer, shared_exec);
   }
@@ -2017,8 +2057,8 @@ Executor *Executor::Bind(nnvm::Symbol symbol,
                                    verbose);
     }
   }
-  exec->Init(symbol, default_ctx, group2ctx, tmp_in_args, tmp_arg_grad_store, tmp_grad_req_type,
-             tmp_aux_states, reinterpret_cast<Executor*>(shared_exec));
+  exec->Init(symbol.Copy(), default_ctx, group2ctx, tmp_in_args, tmp_arg_grad_store,
+             tmp_grad_req_type, tmp_aux_states, reinterpret_cast<Executor*>(shared_exec));
   return exec;
 }
 }  // namespace mxnet