Skip to content

Commit

Permalink
[NNC] enable fusion of conv with elementwise OP (pytorch#77157)
Browse files Browse the repository at this point in the history
## Pitch
Enable Conv-Eltwise fusion in NNC.

## Description
This PR adds a `FuseConvWithEltwise` pass to fuse convolution with elementwise OP for TE subgraph. This pass will insert prepack and packed run ops for conv2d and enable fusion of conv2d with elementwise OPs. The fused packed run ops is implemented via external call in NNC.

## Code structure
Graph rewrite pass related code is placed in:
```
torch/csrc/jit/passes/mkldnn_rewrite.h
torch/csrc/jit/passes/mkldnn_rewrite.cpp
```

NNC integration of fused conv-eltwise OP via external call is located in:
```
torch/csrc/jit/tensorexpr/kernel.cpp

torch/csrc/jit/tensorexpr/operators/conv2d.h
torch/csrc/jit/tensorexpr/operators/conv2d.cpp

torch/csrc/jit/tensorexpr/lowerings.cpp
torch/csrc/jit/tensorexpr/external_functions.cpp
```

Fused prepack OP context is in:
```
aten/src/ATen/native/mkldnn/Common.h
aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp
aten/src/ATen/native/mkldnn/OpContext.h
aten/src/ATen/native/mkldnn/OpContext.cpp
```

Fused OP implementation is done in:
```
aten/src/ATen/native/mkldnn/ConvPrepack.h
aten/src/ATen/native/mkldnn/ConvPrepack.cpp
```

## OP benchmark for conv-relu
The below performance is measured on top of these two PRs to support NHWC: pytorch#76948 and pytorch#78238.

- Measured on Cascade Lake 8280
- Jemalloc enabled
- batch_size = 1
- Channels Last format

### Single thread:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

shape | time (us)_no_fusion | time (us)_fusion | Gain
-- | -- | -- | --
kernel=3, N=1, iC=64, H=56, W=56, oC=64,   stride=1, pad=1, dilates=1, g=1 | 1706.22 | 1371.97 | 19.59%
kernel=1, N=1, iC=256, H=56, W=56,   oC=512, stride=2, pad=0, dilates=1, g=1 | 2499.28 | 1571.52 | 37.12%
kernel=3, N=1, iC=256, H=56, W=56,   oC=256, stride=1, pad=1, dilates=1, g=32 | 4169.52 | 2738.53 | 34.32%
kernel=3, N=1, iC=512, H=56, W=56,   oC=512, stride=2, pad=1, dilates=1, g=32 | 3998.77 | 3085.85 | 22.83%
kernel=1, N=1, iC=64, H=56, W=56, oC=64,   stride=1, pad=0, dilates=1, g=1 | 673.73 | 430.81 | 36.06%
kernel=1, N=1, iC=256, H=56, W=56, oC=64,   stride=1, pad=0, dilates=1, g=1 | 1101.87 | 801.07 | 27.30%
kernel=1, N=1, iC=256, H=56, W=56,   oC=256, stride=1, pad=0, dilates=1, g=1 | 4692.91 | 3116.13 | 33.60%
kernel=1, N=1, iC=512, H=28, W=28,   oC=512, stride=1, pad=0, dilates=1, g=1 | 3310.64 | 2503.39 | 24.38%

</body>

</html>

### 4 threads:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

shape | time (us)_no_fusion | time (us)_fusion | Gain
-- | -- | -- | --
kernel=3, N=1, iC=64, H=56, W=56, oC=64,   stride=1, pad=1, dilates=1, g=1 | 360.07 | 321.21 | 10.79%
kernel=1, N=1, iC=256, H=56, W=56,   oC=512, stride=2, pad=0, dilates=1, g=1 | 391.49 | 323.17 | 17.45%
kernel=3, N=1, iC=256, H=56, W=56,   oC=256, stride=1, pad=1, dilates=1, g=32 | 536.4 | 465.97 | 13.13%
kernel=3, N=1, iC=512, H=56, W=56,   oC=512, stride=2, pad=1, dilates=1, g=32 | 674.98 | 616.32 | 8.69%
kernel=1, N=1, iC=64, H=56, W=56, oC=64,   stride=1, pad=0, dilates=1, g=1 | 160.97 | 70.05 | 56.48%
kernel=1, N=1, iC=256, H=56, W=56, oC=64,   stride=1, pad=0, dilates=1, g=1 | 215.81 | 182.6 | 15.39%
kernel=1, N=1, iC=256, H=56, W=56,   oC=256, stride=1, pad=0, dilates=1, g=1 | 658.45 | 576.97 | 12.37%
kernel=1, N=1, iC=512, H=28, W=28,   oC=512, stride=1, pad=0, dilates=1, g=1 | 702.18 | 566.39 | 19.34%

</body>

</html>

### 1 socket (28 cores):
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

shape | time (us)_no_fusion | time (us)_fusion | Gain
-- | -- | -- | --
kernel=3, N=1, iC=64, H=56, W=56, oC=64,   stride=1, pad=1, dilates=1, g=1 | 149.92 | 103.78 | 30.78%
kernel=1, N=1, iC=256, H=56, W=56,   oC=512, stride=2, pad=0, dilates=1, g=1 | 192.76 | 110.87 | 42.48%
kernel=3, N=1, iC=256, H=56, W=56,   oC=256, stride=1, pad=1, dilates=1, g=32 | 160.67 | 127.24 | 20.81%
kernel=3, N=1, iC=512, H=56, W=56,   oC=512, stride=2, pad=1, dilates=1, g=32 | 212.45 | 180.55 | 15.02%
kernel=1, N=1, iC=64, H=56, W=56, oC=64,   stride=1, pad=0, dilates=1, g=1 | 114.57 | 50.58 | 55.85%
kernel=1, N=1, iC=256, H=56, W=56, oC=64,   stride=1, pad=0, dilates=1, g=1 | 198.64 | 70.6 | 64.46%
kernel=1, N=1, iC=256, H=56, W=56,   oC=256, stride=1, pad=0, dilates=1, g=1 | 281.35 | 155.8 | 44.62%
kernel=1, N=1, iC=512, H=28, W=28,   oC=512, stride=1, pad=0, dilates=1, g=1 | 262.15 | 162.94 | 37.84%

</body>

</html>

## UT
```
test/test_mkldnn_fusion.py
```
Pull Request resolved: pytorch#77157
Approved by: https://github.com/ZolotukhinM
  • Loading branch information
chunyuan-w authored and pytorchmergebot committed Aug 10, 2022
1 parent 1c83ec8 commit 693a8dd
Show file tree
Hide file tree
Showing 17 changed files with 1,149 additions and 5 deletions.
46 changes: 46 additions & 0 deletions aten/src/ATen/native/mkldnn/Common.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#pragma once

#include <ATen/ATen.h>
#include <ATen/Config.h>

#if AT_MKLDNN_ENABLED()

#include <ideep/tensor.hpp>

namespace at {
namespace native {
namespace mkldnn {

struct ContextConv final {
ideep::tensor weight_packed_;
c10::optional<at::Tensor> at_bias_;
std::vector<int64_t> padding_;
std::vector<int64_t> stride_;
std::vector<int64_t> dilation_;
int64_t groups_;
ideep::attr_t attr_;

ContextConv() = delete;

ContextConv(
ideep::tensor&& weight_packed,
c10::optional<at::Tensor> at_bias,
std::vector<int64_t> padding,
std::vector<int64_t> stride,
std::vector<int64_t> dilation,
int64_t groups,
ideep::attr_t attr)
: weight_packed_(std::move(weight_packed)),
at_bias_(std::move(at_bias)),
padding_(padding),
stride_(stride),
dilation_(dilation),
groups_(groups),
attr_(attr) {}
};

} // namespace mkldnn
} // namespace native
} // namespace at

#endif // AT_MKLDNN_ENABLED()
289 changes: 289 additions & 0 deletions aten/src/ATen/native/mkldnn/ConvPrepack.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,289 @@
#include <vector>

#include <ATen/native/ConvUtils.h>
#include <ATen/native/mkldnn/Common.h>
#include <ATen/native/mkldnn/ConvPrepack.h>
#include <ATen/native/mkldnn/MKLDNNCommon.h>
#include <ATen/native/mkldnn/OpContext.h>
#include <ATen/native/utils/Factory.h>
#include <ATen/native/utils/ParamUtils.h>
#include <c10/util/irange.h>

#if AT_MKLDNN_ENABLED()

namespace at {
namespace native {
namespace mkldnn {
namespace internal {
namespace convolution {

c10::intrusive_ptr<mkldnn::ConvOpContext> createConvPrePackOpContext(
Tensor weight,
c10::optional<Tensor> bias,
std::vector<int64_t> stride,
std::vector<int64_t> padding,
std::vector<int64_t> dilation,
int64_t groups,
std::vector<int64_t> input_size,
std::string attr) {
auto it = fusion_attr_map.find(attr);
TORCH_CHECK(it != fusion_attr_map.end(), "Fusion behavior undefined.");
ideep::attr_t op_attr = it->second;

return mkldnn::MkldnnConvOpContext::create_context(
std::move(weight),
std::move(bias),
std::move(padding),
std::move(stride),
std::move(dilation),
groups,
std::move(input_size),
op_attr);
}

ContextConv create(
const Tensor& weight,
const c10::optional<Tensor>& bias,
const IntArrayRef padding,
const IntArrayRef stride,
const IntArrayRef dilation,
const int64_t groups,
const IntArrayRef input_size,
const ideep::attr_t& attr) {
auto k = weight.ndimension();
int64_t dim = k - 2;
const auto padding_expanded = expand_param_if_needed(padding, "padding", dim);
const auto stride_expanded = expand_param_if_needed(stride, "stride", dim);
const auto dilation_expanded =
expand_param_if_needed(dilation, "dilation", dim);
const auto input_size_expanded =
expand_param_if_needed(input_size, "input_size", k);

c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
auto w = itensor_view_from_dense(weight);
// TODO: what if input is nhwc but w is nchw
bool is_channels_last =
weight.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
ideep::tensor::desc expected_weight_desc =
ideep::convolution_forward::expected_weights_desc(
w.get_dims(),
w.get_data_type(),
{stride_expanded.begin(), stride_expanded.end()},
{padding_expanded.begin(), padding_expanded.end()},
{padding_expanded.begin(), padding_expanded.end()},
{dilation_expanded.begin(), dilation_expanded.end()},
groups,
ideep::algorithm::convolution_direct,
ideep::prop_kind::forward,
/*x_dtype*/ w.get_data_type(),
{input_size_expanded.begin(), input_size_expanded.end()},
attr,
is_channels_last);

ideep::tensor packed_weight;
packed_weight.init(expected_weight_desc);
packed_weight.feed_from(w);

return ContextConv{
std::move(packed_weight),
bias.has_value() ? c10::make_optional(*bias) : c10::nullopt,
{padding_expanded.begin(), padding_expanded.end()},
{stride_expanded.begin(), stride_expanded.end()},
{dilation_expanded.begin(), dilation_expanded.end()},
groups,
std::move(attr)};
}

void _mkldnn_convolution_out(
const ideep::tensor& x,
ideep::tensor& y,
const ideep::tensor& w,
const c10::optional<ideep::tensor>& b,
IntArrayRef padding,
IntArrayRef stride,
IntArrayRef dilation,
IntArrayRef output_sizes,
int64_t groups,
const ideep::attr_t& attr = ideep::attr_t()) {
if (b.has_value()) {
ideep::convolution_forward::compute_v2(
x,
w,
b.value(),
{output_sizes.cbegin(), output_sizes.cend()},
y,
{stride.begin(), stride.end()},
{dilation.begin(), dilation.end()},
{padding.begin(), padding.end()},
{padding.begin(), padding.end()},
groups,
ideep::scale_t(),
ideep::scale_t(),
ideep::scale_t(),
ideep::zero_point_t(),
ideep::zero_point_t(),
attr);
} else {
ideep::convolution_forward::compute_v2(
x,
w,
{output_sizes.cbegin(), output_sizes.cend()},
y,
{stride.begin(), stride.end()},
{dilation.begin(), dilation.end()},
{padding.begin(), padding.end()},
{padding.begin(), padding.end()},
groups,
ideep::scale_t(),
ideep::scale_t(),
ideep::scale_t(),
ideep::zero_point_t(),
ideep::zero_point_t(),
attr);
}
}

void mkldnn_convolution_out(
const Tensor& input,
ideep::tensor& mkldnn_output,
const ideep::tensor& mkldnn_weight,
const c10::optional<Tensor>& bias_opt,
IntArrayRef padding,
IntArrayRef stride,
IntArrayRef dilation,
IntArrayRef output_sizes,
int64_t groups,
const ideep::attr_t& attr = ideep::attr_t()) {
c10::MaybeOwned<Tensor> bias_maybe_owned =
at::borrow_from_optional_tensor(bias_opt);
const Tensor& bias = *bias_maybe_owned;

c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
const ideep::tensor mkldnn_input = itensor_from_tensor(input);
c10::optional<ideep::tensor> mkldnn_bias{c10::nullopt};
if (bias.defined()) {
mkldnn_bias = itensor_from_tensor(bias);
}

_mkldnn_convolution_out(
mkldnn_input,
mkldnn_output,
mkldnn_weight,
mkldnn_bias,
padding,
stride,
dilation,
output_sizes,
groups,
attr);
}

std::vector<int64_t> get_output_sizes(
ContextConv& context,
const Tensor& input) {
const ideep::tensor& mkldnn_weight = context.weight_packed_;
IntArrayRef padding = context.padding_;
IntArrayRef stride = context.stride_;
IntArrayRef dilation = context.dilation_;

auto kernel_size = mkldnn_weight.get_dims();

std::vector<int64_t> input_size = input.sizes().vec();
return conv_output_size(input_size, kernel_size, padding, stride, dilation);
}

Tensor run(ContextConv& context, const Tensor& input) {
std::vector<int64_t> output_sizes = get_output_sizes(context, input);
auto output = at::empty(
output_sizes,
input.options().memory_format(input.suggest_memory_format()));

bool is_channels_last =
input.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
ideep::tensor y;

c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
ideep::tensor mkldnn_output = itensor_from_tensor(output);

if (is_channels_last) {
mkldnn_convolution_out(
input,
mkldnn_output,
context.weight_packed_,
context.at_bias_,
context.padding_,
context.stride_,
context.dilation_,
output_sizes,
context.groups_,
context.attr_);
} else {
mkldnn_convolution_out(
input,
y,
context.weight_packed_,
context.at_bias_,
context.padding_,
context.stride_,
context.dilation_,
output_sizes,
context.groups_,
context.attr_);
mkldnn_output.feed_from(y);
}
return output;
}

void run(ContextConv& context, const Tensor& input, void* output) {
std::vector<int64_t> output_sizes = get_output_sizes(context, input);

bool is_channels_last =
input.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
ideep::tensor y;

ideep::tag o_tag = is_channels_last ? ideep::tag::nhwc : ideep::tag::nchw;
ideep::tensor::desc o_desc = {
output_sizes, get_mkldnn_dtype(input.scalar_type()), o_tag};
ideep::tensor mkldnn_output = {o_desc, output};

if (is_channels_last) {
mkldnn_convolution_out(
input,
mkldnn_output,
context.weight_packed_,
context.at_bias_,
context.padding_,
context.stride_,
context.dilation_,
output_sizes,
context.groups_,
context.attr_);
} else {
mkldnn_convolution_out(
input,
y,
context.weight_packed_,
context.at_bias_,
context.padding_,
context.stride_,
context.dilation_,
output_sizes,
context.groups_,
context.attr_);
mkldnn_output.feed_from(y);
}
}

Tensor conv_run(
const Tensor& input,
const c10::intrusive_ptr<mkldnn::ConvOpContext>& op_context) {
return op_context->run(input);
}

} // namespace convolution
} // namespace internal
} // namespace mkldnn
} // namespace native
} // namespace at

#endif // AT_MKLDNN_ENABLED()
49 changes: 49 additions & 0 deletions aten/src/ATen/native/mkldnn/ConvPrepack.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#pragma once

#include <ATen/Tensor.h>
#include <ATen/native/mkldnn/Common.h>
#include <ATen/native/mkldnn/OpContext.h>

#if AT_MKLDNN_ENABLED()

namespace at {
namespace native {
namespace mkldnn {
namespace internal {
namespace convolution {

c10::intrusive_ptr<mkldnn::ConvOpContext> createConvPrePackOpContext(
Tensor weight,
c10::optional<Tensor> bias,
std::vector<int64_t> stride,
std::vector<int64_t> padding,
std::vector<int64_t> dilation,
int64_t groups,
std::vector<int64_t> input_size,
std::string attr);

Tensor conv_run(
const Tensor& input,
const c10::intrusive_ptr<mkldnn::ConvOpContext>& op_context);

ContextConv create(
const Tensor& weight,
const c10::optional<Tensor>& bias,
const IntArrayRef padding,
const IntArrayRef stride,
const IntArrayRef dilation,
const int64_t groups,
const IntArrayRef input_size,
const ideep::attr_t& attr);

Tensor run(ContextConv& context, const Tensor& input);

void run(ContextConv& context, const Tensor& input, void* output);

} // namespace convolution
} // namespace internal
} // namespace mkldnn
} // namespace native
} // namespace at

#endif // AT_MKLDNN_ENABLED()
Loading

0 comments on commit 693a8dd

Please sign in to comment.