Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Commit

Permalink
Pointwise fusion for GPU (#15167)
Browse files Browse the repository at this point in the history
* Beginning of RTC of pointwise ops

* Code generation from the given JSON

* add initial simple_partition_pass and use it for pointwise fusion

* fix the fusion, use a symbol.Copy() at the beginning of binding function, use the name of input nodes in the cuda code

* Fixes

* Adding support for attribute inference for backward nodes when fusing

* keep proper input ordering for fused Op

* instantiate the indexed_graph before starting the subgraph replacement, return a new graph to reset the indexed_graph

* Fuse backward

* fix ordering of subgraph node inputs using subgraph topological ordering instead of main graph topological ordering, add tvm.patch

* excluse forward node fusion during the fusion of the nodes in the backward graph

* Dealing with fused backward nodes inferattr

* use subgraph.indexed_graph() instead of main for _FusedOpHelper nodes node_id, invert control_deps loop to modify topology of subgraph before calling its indexed_graph(), check that all node of the first DFSVisit are actually in the subgraph

* Adding support for other reqs in codegen

* Fix

* Cleaning

* Change the TVM submodule

* More cleaning

* Making linter happy

* Do fusion only if default context is GPU

* Fixes for tests
Add powerscalar and rpowerscalar, fix return type of zero and one
Cleaning, fixing lint
Go back to proper TVM submodule

* Fix the TVM commit

* Fix lint

* Guard fusion with MXNET_USE_CUDA

* Fix

* Fix clang-tidy

* Add erf and erfinv backward

* Gluon support for fusion

* Cleaning

* Cleaning and allow shape/type change in FusedOp

* Fixing Gluon bugs

* Fixing after rebase

* Fixing race condition and guarding against races when using NVRTC

* Cleaning and renaming FusedOp to _FusedOp

* Going easy on Windows compiler

* Disable fusion on Windows for now

* Refactor InferAttr and InferShapeAttr

* Added slice and half2 support to FusedOp

* Fix lint errors

* Added multiple types support for vector loading/storing

* add slice fusion when it's at the beginning of subgraphs

* Removed constant ndim assumption in fused op

* Fix memory alignment issue in slice for FusedOp

* Fixes

* Fix lint errors

* Do not include cuda_fp16.h

* Refactor fused op op lists

* Make linter happy

* Changes from review

* Fixes after rebase

* Expand FusedOp support for slice

* Fix for fp16 _zeros and _ones

* Fix

* Moving aux functions to unnamed namespace and detail namespace -> fusion
namespace

* Disabling fusion if it alters topological order of inputs

* Print code only when env variable is set

* Fix

* Fix lint and 2 tests that specify the same names for multiple inputs

* Fixes from review and disabling fusion of slice with non-default step

* Add amp_cast to fusion, fixes

* Add amp_multicast and its backward to the list of support ops

* Apply wording suggestions from code review

Co-Authored-By: Aaron Markham <[email protected]>

* Apply wording suggestions from code review

Co-Authored-By: Aaron Markham <[email protected]>

* Make clearer comment

* Adding punctuation and capitalization to \brief descriptions

* Fix

* Fix

* Add backward_cast to fusion

* Adding unittests for fusion. Fix for erfinv_grad

* Adding slice ops and add_n to tests

* Fixes from review

* Setting inplace option

* Fix lint

* Storing double in half

* Retrigger CI

* Slight relaxing of the relative tolerance in the test

* Move the env variable check to the end

* Fix a race condition between InferShape and scheduled Forward

* Fix flakey test_fusion test involving fp32 erfinv op.

* Fix from review

* Added broadcast_like and slice_like to fused op

* Minor fix and cleanup

* Added negative axis support in slice_axis, temporarily disabled fusion of slice_like and broadcast_like

* Added axes support to slice_like

* Added axis support to broadcast_like

* Add fast_load_slice function to fused op code

* Added runtime switch for choosing fast and slow slice kernel

* Fix lint and warning

* Going easy on Windows compiler (again)

* Fix slice_like

* Debug broadcast_like fusion

* Fix lint

* Fix lint

* Trigger CI

* Get rid of the initializer list

* Fix backward calls with different gradient type

* avoid cycle when adding node specific for inputs of subgraph for pointwise fusion

* Fix lint

* Add namespace to the fusion implementations

* Set launch bounds on the fused kernel

* Fix NumPy tests

* Test showcasing an issue fixed in PR #16553

* Cast scalarts to FP32 and perform (a*1.0/b) instead of (a/b)

Fix lint errors

Fix lint

* Fix a bug in cycle detection for inputs only op in pointwise fusion

* Add comments to simple_partition_pass.h file
  • Loading branch information
ptrendx authored and apeforest committed Nov 6, 2019
1 parent 0cbee04 commit 51c2065
Show file tree
Hide file tree
Showing 20 changed files with 3,862 additions and 216 deletions.
25 changes: 18 additions & 7 deletions docs/static_site/src/pages/api/faq/env_var.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,12 +200,12 @@ The following environments can be used to profile the application without changi

* MXNET_PROFILER_AUTOSTART
- Values: 0(false) or 1(true) ```(default=0)```
- Set to 1, MXNet starts the profiler automatically. The profiling result is stored into profile.json in the working directory.
- Set to 1, MXNet starts the profiler automatically. The profiling result is stored into profile.json in the working directory.

* MXNET_PROFILER_MODE
- Values: 0(false) or 1(true) ```(default=0)```
- If set to '0', profiler records the events of the symbolic operators.
- If set to '1', profiler records the events of all operators.
- If set to '0', profiler records the events of the symbolic operators.
- If set to '1', profiler records the events of all operators.

## Interface between Python and the C API

Expand Down Expand Up @@ -241,14 +241,14 @@ If ctypes is used, it must be `mxnet._ctypes.ndarray.NDArrayBase`.

* MXNET_CUDA_ALLOW_TENSOR_CORE
- 0(false) or 1(true) ```(default=1)```
- If set to '0', disallows Tensor Core use in CUDA ops.
- If set to '1', allows Tensor Core use in CUDA ops.
- If set to '0', disallows Tensor Core use in CUDA ops.
- If set to '1', allows Tensor Core use in CUDA ops.
- This variable can only be set once in a session.

* MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION
- 0(false) or 1(true) ```(default=0)```
- If set to '0', disallows implicit type conversions to Float16 to use Tensor Cores
- If set to '1', allows CUDA ops like RNN and Convolution to use TensorCores even with Float32 input data by using implicit type casting to Float16. Only has an effect if `MXNET_CUDA_ALLOW_TENSOR_CORE` is `1`.
- If set to '0', disallows implicit type conversions to Float16 to use Tensor Cores
- If set to '1', allows CUDA ops like RNN and Convolution to use TensorCores even with Float32 input data by using implicit type casting to Float16. Only has an effect if `MXNET_CUDA_ALLOW_TENSOR_CORE` is `1`.

* MXNET_CUDA_LIB_CHECKING
- 0(false) or 1(true) ```(default=1)```
Expand Down Expand Up @@ -328,6 +328,17 @@ If ctypes is used, it must be `mxnet._ctypes.ndarray.NDArrayBase`.
with float32.
- Model accuracies do not necessarily improve with this environment variable turned on.

* MXNET_USE_FUSION
- Values: 0(false) or 1(true) ```(default=1)```
- If this variable is set, MXNet will try fusing some of the operations (pointwise operations only for now).
- It works in Symbolic execution as well as in Gluon models hybridized with ```static_alloc=True``` option.
- Only applies to MXNet that has been compiled with CUDA (```pip install mxnet-cuXX``` or built from source with ```USE_CUDA=1```) and running on GPU.

* MXNET_FUSION_VERBOSE
- Values: 0(false) or 1(true) ```(default=0)```
- Only applies to MXNet that has been compiled with CUDA and when ```MXNET_USE_FUSION``` option is enabled.
- If this variable is set, MXNet will print the code for fused operators that it generated.

Settings for Minimum Memory Usage
---------------------------------
- Make sure ```min(MXNET_EXEC_NUM_TEMP, MXNET_GPU_WORKER_NTHREADS) = 1```
Expand Down
79 changes: 79 additions & 0 deletions src/common/exec_utils.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

/*!
* \file exec_utils.cc
* \brief Implementation of executor util functions.
*/

#include "exec_utils.h"
#include <unordered_set>
#include <unordered_map>
#include <string>

namespace mxnet {
namespace common {

void CopyGraph(nnvm::Graph *dst, const nnvm::Graph &src, bool copy_variables) {
using nnvm::Node;
using nnvm::NodePtr;
using nnvm::NodeEntry;
std::unordered_map<Node*, NodePtr> old_new;
// use DFSVisit to copy all the nodes
DFSVisit(src.outputs, [&old_new, copy_variables](const NodePtr& node) {
NodePtr np;
if (copy_variables || !node->is_variable()) {
np = Node::Create();
np->attrs = node->attrs;
} else {
np = node;
}
old_new[node.get()] = std::move(np);
});
// connect nodes of new graph
for (const auto &kv : old_new) {
for (const NodeEntry& e : kv.first->inputs) {
Node *ptr = e.node.get();
kv.second->inputs.emplace_back(NodeEntry{old_new[ptr], e.index, e.version});
}
for (const NodePtr& p : kv.first->control_deps) {
kv.second->control_deps.emplace_back(old_new[p.get()]);
}
}
// set the head
for (const NodeEntry &e : src.outputs) {
(*dst).outputs.emplace_back(NodeEntry{old_new[e.node.get()], e.index, e.version});
}
}

bool CheckForInputNameDuplicates(const nnvm::IndexedGraph &idx) {
std::unordered_set<std::string> names;
for (const auto& nid : idx.input_nodes()) {
const std::string &name = idx[nid].source->attrs.name;
if (names.count(name)) {
LOG(WARNING) << "Variable name " << name << " is used more than once!";
return false;
}
names.insert(name);
}
return true;
}

} // namespace common
} // namespace mxnet
19 changes: 19 additions & 0 deletions src/common/exec_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -621,6 +621,25 @@ inline nnvm::Graph AssignContext(nnvm::Graph g,
return g;
}

/*!
* \brief Copy the graph, optionally leaving original Variable nodes.
*
* \param dst destination graph
* \param src source graph being copied
* \param copy_variable whether to copy or reuse Variable nodes from the
* source graph
*/
void CopyGraph(nnvm::Graph *dst, const nnvm::Graph &src, bool copy_variables);

/*!
* \brief Check whether graph contains any duplicated names in its inputs.
*
* \param idx Indexed graph being checked
*
* \return true if there are no duplicates, false otherwise
*/
bool CheckForInputNameDuplicates(const nnvm::IndexedGraph &idx);

} // namespace common
} // namespace mxnet
#endif // MXNET_COMMON_EXEC_UTILS_H_
Expand Down
42 changes: 42 additions & 0 deletions src/executor/exec_pass.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,34 @@
#include <vector>
#include <memory>
#include <string>
#include <utility>
#include <tuple>

namespace mxnet {
namespace exec {

template <typename Attr>
using FAccessSubgraphAttr = std::function<std::tuple<const nnvm::NodePtr,
std::vector<Attr>,
std::vector<Attr>>
(const NodeAttrs& attrs)>;

using FAccessSubgraphShape = FAccessSubgraphAttr<mxnet::TShape>;
using FAccessSubgraphType = FAccessSubgraphAttr<int>;
using FAccessSubgraphStorageType = FAccessSubgraphAttr<int>;

template <typename Attr>
using FProvideSubgraphAttr = std::function<void (const NodeAttrs& attrs,
const std::vector<nnvm::NodePtr> &nodes,
const std::vector<std::vector<Attr>> &in_attrs,
const std::vector<std::vector<Attr>> &out_attrs)>;
using FProvideSubgraphShape = FProvideSubgraphAttr<mxnet::TShape>;
using FProvideSubgraphType = FProvideSubgraphAttr<int>;
using FProvideSubgraphStorageType = FProvideSubgraphAttr<int>;

using TIsFusion = bool;
using TIsFusionHelper = bool;

/*! \brief reuse graph definition */
using nnvm::Graph;

Expand Down Expand Up @@ -170,6 +194,24 @@ void AttachOpResources(const Graph& g,
*/
Graph DetectInplaceAddTo(Graph g);

/*!
* \brief Fuse pointwise operations in the forward pass.
*
* \param g input graph (needs to be entire graph, not just forward part)
*
* \return graph with fused pointwise operations in the forward pass
*/
Graph FusePointwiseForward(Graph&& g);

/*!
* \brief Fuse pointwise operations in the backward pass.
*
* \param g input graph (needs to be entire graph, not just forward part)
*
* \return graph with fused pointwise operations in the backward pass
*/
Graph FusePointwiseBackward(Graph&& g);

/*!
* \brief Infer shapes in the graph given the information.
* \param graph The input graph.
Expand Down
48 changes: 44 additions & 4 deletions src/executor/graph_executor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
#include <nnvm/graph.h>
#include <nnvm/pass_functions.h>
#include <vector>
#include <set>
#include <algorithm>

#include "./exec_pass.h"
Expand Down Expand Up @@ -337,6 +338,7 @@ nnvm::Graph GraphExecutor::InitFullGraph(nnvm::Symbol symbol,
if (!need_grad_) return g;
for (size_t i = 0; i < g.outputs.size(); ++i) {
NodeEntry ngrad(nnvm::Node::Create(), 0, 0);
ngrad.node->attrs.name = "_head_grad_" + std::to_string(i);
head_grad_entry_.emplace_back(AttrHint(ngrad, g.outputs[i]));
head_grad_map_[ngrad.node.get()] = i;
}
Expand Down Expand Up @@ -377,6 +379,7 @@ nnvm::Graph GraphExecutor::InitFullGraph(nnvm::Symbol symbol,
for (const auto &e : g_grad.outputs) {
g.outputs.push_back(e);
}

return g;
}

Expand Down Expand Up @@ -796,6 +799,7 @@ void GraphExecutor::Init(nnvm::Symbol symbol,
const nnvm::NodeEntryMap<NDArray>& feed_dict) {
nnvm::Graph g = InitGraph(symbol, default_ctx, ctx_map, in_arg_ctxes, arg_grad_ctxes,
aux_state_ctxes, grad_req_types);

// The following code of shape and dtype inferences and argument
// initialization is for simple_bind only. Regular bind operation
// should do this differently.
Expand Down Expand Up @@ -976,6 +980,7 @@ Executor* GraphExecutor::Reshape(const bool partial_shaping,
this);
return exec;
}

/*!
* \brief This function is triggered by both simple_bind
* and bind flows.
Expand All @@ -993,6 +998,41 @@ Graph GraphExecutor::InitGraph(nnvm::Symbol symbol,
// setup gradient
nnvm::Graph g = InitFullGraph(symbol, grad_req_types);

#if MXNET_USE_CUDA && !defined(_WIN32)
if (default_ctx.dev_mask() == Context::kGPU && dmlc::GetEnv("MXNET_USE_FUSION", true)) {
nnvm::Graph unoptimized_graph;
common::CopyGraph(&unoptimized_graph, g, false);

if (common::CheckForInputNameDuplicates(unoptimized_graph.indexed_graph())) {
g.attrs["num_forward_outputs"] = std::make_shared<nnvm::any>(num_forward_outputs_);
g = FusePointwiseForward(std::move(g));
g.attrs["num_forward_outputs"] = std::make_shared<nnvm::any>(num_forward_outputs_);
g = FusePointwiseBackward(std::move(g));
// Check the topological order of inputs
const auto &original_inputs = unoptimized_graph.indexed_graph().input_nodes();
const auto &new_inputs = g.indexed_graph().input_nodes();
if (original_inputs.size() != new_inputs.size()) {
LOG(WARNING)
<< "Number of inputs after fusion does not match original number of inputs. "
<< "This is most probably a bug. Disabling fusion for this run.";
g = unoptimized_graph;
} else {
for (size_t i = 0; i < new_inputs.size(); ++i) {
if (unoptimized_graph.indexed_graph()[original_inputs[i]].source->attrs.name !=
g.indexed_graph()[new_inputs[i]].source->attrs.name) {
LOG(WARNING) << "Disabling fusion due to altered topological order of inputs.";
g = unoptimized_graph;
break;
}
}
}
} else {
LOG(WARNING)
<< "Graph contains duplicate names for some of its inputs - fusion is NOT enabled!";
}
}
#endif // MXNET_USE_CUDA

// create "device" and "context" attrs for the graph
g = AssignContext(g, default_ctx, ctx_map,
in_arg_ctxes,
Expand Down Expand Up @@ -1946,7 +1986,7 @@ Executor *Executor::SimpleBind(nnvm::Symbol symbol,
symbol = exec::BuildSubgraph(symbol, backend, arg_shape_map, arg_dtype_map, arg_stype_map,
default_ctx, group2ctx, &tmp_in_arg_ctxes, &tmp_arg_grad_ctxes,
&tmp_grad_req_types, &tmp_aux_state_ctxes, verbose);
exec->Init(symbol, default_ctx, group2ctx, tmp_in_arg_ctxes, tmp_arg_grad_ctxes,
exec->Init(symbol.Copy(), default_ctx, group2ctx, tmp_in_arg_ctxes, tmp_arg_grad_ctxes,
tmp_aux_state_ctxes, arg_shape_map, arg_dtype_map, arg_stype_map,
tmp_grad_req_types, shared_arg_names, &tmp_in_args, &tmp_arg_grads,
&tmp_aux_states, shared_buffer, shared_exec);
Expand Down Expand Up @@ -1985,7 +2025,7 @@ Executor *Executor::SimpleBind(nnvm::Symbol symbol,
}
if (!init) {
// init without subgraph
exec->Init(symbol, default_ctx, group2ctx, in_arg_ctxes, arg_grad_ctxes, aux_state_ctxes,
exec->Init(symbol.Copy(), default_ctx, group2ctx, in_arg_ctxes, arg_grad_ctxes, aux_state_ctxes,
arg_shape_map, arg_dtype_map, arg_stype_map, grad_req_types, shared_arg_names,
in_args, arg_grads, aux_states, shared_buffer, shared_exec);
}
Expand Down Expand Up @@ -2017,8 +2057,8 @@ Executor *Executor::Bind(nnvm::Symbol symbol,
verbose);
}
}
exec->Init(symbol, default_ctx, group2ctx, tmp_in_args, tmp_arg_grad_store, tmp_grad_req_type,
tmp_aux_states, reinterpret_cast<Executor*>(shared_exec));
exec->Init(symbol.Copy(), default_ctx, group2ctx, tmp_in_args, tmp_arg_grad_store,
tmp_grad_req_type, tmp_aux_states, reinterpret_cast<Executor*>(shared_exec));
return exec;
}
} // namespace mxnet
Loading

0 comments on commit 51c2065

Please sign in to comment.