[NNC] enable fusion of conv with elementwise OP (pytorch#77157)

## Pitch Enable Conv-Eltwise fusion in NNC. ## Description This PR adds a `FuseConvWithEltwise` pass to fuse convolution with elementwise OP for TE subgraph. This pass will insert prepack and packed run ops for conv2d and enable fusion of conv2d with elementwise OPs. The fused packed run ops is implemented via external call in NNC. ## Code structure Graph rewrite pass related code is placed in: ``` torch/csrc/jit/passes/mkldnn_rewrite.h torch/csrc/jit/passes/mkldnn_rewrite.cpp ``` NNC integration of fused conv-eltwise OP via external call is located in: ``` torch/csrc/jit/tensorexpr/kernel.cpp torch/csrc/jit/tensorexpr/operators/conv2d.h torch/csrc/jit/tensorexpr/operators/conv2d.cpp torch/csrc/jit/tensorexpr/lowerings.cpp torch/csrc/jit/tensorexpr/external_functions.cpp ``` Fused prepack OP context is in: ``` aten/src/ATen/native/mkldnn/Common.h aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp aten/src/ATen/native/mkldnn/OpContext.h aten/src/ATen/native/mkldnn/OpContext.cpp ``` Fused OP implementation is done in: ``` aten/src/ATen/native/mkldnn/ConvPrepack.h aten/src/ATen/native/mkldnn/ConvPrepack.cpp ``` ## OP benchmark for conv-relu The below performance is measured on top of these two PRs to support NHWC: pytorch#76948 and pytorch#78238. - Measured on Cascade Lake 8280 - Jemalloc enabled - batch_size = 1 - Channels Last format ### Single thread: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> shape | time (us)_no_fusion | time (us)_fusion | Gain -- | -- | -- | -- kernel=3, N=1, iC=64, H=56, W=56, oC=64, stride=1, pad=1, dilates=1, g=1 | 1706.22 | 1371.97 | 19.59% kernel=1, N=1, iC=256, H=56, W=56, oC=512, stride=2, pad=0, dilates=1, g=1 | 2499.28 | 1571.52 | 37.12% kernel=3, N=1, iC=256, H=56, W=56, oC=256, stride=1, pad=1, dilates=1, g=32 | 4169.52 | 2738.53 | 34.32% kernel=3, N=1, iC=512, H=56, W=56, oC=512, stride=2, pad=1, dilates=1, g=32 | 3998.77 | 3085.85 | 22.83% kernel=1, N=1, iC=64, H=56, W=56, oC=64, stride=1, pad=0, dilates=1, g=1 | 673.73 | 430.81 | 36.06% kernel=1, N=1, iC=256, H=56, W=56, oC=64, stride=1, pad=0, dilates=1, g=1 | 1101.87 | 801.07 | 27.30% kernel=1, N=1, iC=256, H=56, W=56, oC=256, stride=1, pad=0, dilates=1, g=1 | 4692.91 | 3116.13 | 33.60% kernel=1, N=1, iC=512, H=28, W=28, oC=512, stride=1, pad=0, dilates=1, g=1 | 3310.64 | 2503.39 | 24.38% </body> </html> ### 4 threads: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> shape | time (us)_no_fusion | time (us)_fusion | Gain -- | -- | -- | -- kernel=3, N=1, iC=64, H=56, W=56, oC=64, stride=1, pad=1, dilates=1, g=1 | 360.07 | 321.21 | 10.79% kernel=1, N=1, iC=256, H=56, W=56, oC=512, stride=2, pad=0, dilates=1, g=1 | 391.49 | 323.17 | 17.45% kernel=3, N=1, iC=256, H=56, W=56, oC=256, stride=1, pad=1, dilates=1, g=32 | 536.4 | 465.97 | 13.13% kernel=3, N=1, iC=512, H=56, W=56, oC=512, stride=2, pad=1, dilates=1, g=32 | 674.98 | 616.32 | 8.69% kernel=1, N=1, iC=64, H=56, W=56, oC=64, stride=1, pad=0, dilates=1, g=1 | 160.97 | 70.05 | 56.48% kernel=1, N=1, iC=256, H=56, W=56, oC=64, stride=1, pad=0, dilates=1, g=1 | 215.81 | 182.6 | 15.39% kernel=1, N=1, iC=256, H=56, W=56, oC=256, stride=1, pad=0, dilates=1, g=1 | 658.45 | 576.97 | 12.37% kernel=1, N=1, iC=512, H=28, W=28, oC=512, stride=1, pad=0, dilates=1, g=1 | 702.18 | 566.39 | 19.34% </body> </html> ### 1 socket (28 cores): <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> shape | time (us)_no_fusion | time (us)_fusion | Gain -- | -- | -- | -- kernel=3, N=1, iC=64, H=56, W=56, oC=64, stride=1, pad=1, dilates=1, g=1 | 149.92 | 103.78 | 30.78% kernel=1, N=1, iC=256, H=56, W=56, oC=512, stride=2, pad=0, dilates=1, g=1 | 192.76 | 110.87 | 42.48% kernel=3, N=1, iC=256, H=56, W=56, oC=256, stride=1, pad=1, dilates=1, g=32 | 160.67 | 127.24 | 20.81% kernel=3, N=1, iC=512, H=56, W=56, oC=512, stride=2, pad=1, dilates=1, g=32 | 212.45 | 180.55 | 15.02% kernel=1, N=1, iC=64, H=56, W=56, oC=64, stride=1, pad=0, dilates=1, g=1 | 114.57 | 50.58 | 55.85% kernel=1, N=1, iC=256, H=56, W=56, oC=64, stride=1, pad=0, dilates=1, g=1 | 198.64 | 70.6 | 64.46% kernel=1, N=1, iC=256, H=56, W=56, oC=256, stride=1, pad=0, dilates=1, g=1 | 281.35 | 155.8 | 44.62% kernel=1, N=1, iC=512, H=28, W=28, oC=512, stride=1, pad=0, dilates=1, g=1 | 262.15 | 162.94 | 37.84% </body> </html> ## UT ``` test/test_mkldnn_fusion.py ``` Pull Request resolved: pytorch#77157 Approved by: https://github.com/ZolotukhinM
heysaeed · Aug 10, 2022 · 693a8dd · 693a8dd
1 parent 1c83ec8
commit 693a8dd
Show file tree

Hide file tree

Showing 17 changed files with 1,149 additions and 5 deletions.
diff --git a/aten/src/ATen/native/mkldnn/Common.h b/aten/src/ATen/native/mkldnn/Common.h
@@ -0,0 +1,46 @@
+#pragma once
+
+#include <ATen/ATen.h>
+#include <ATen/Config.h>
+
+#if AT_MKLDNN_ENABLED()
+
+#include <ideep/tensor.hpp>
+
+namespace at {
+namespace native {
+namespace mkldnn {
+
+struct ContextConv final {
+  ideep::tensor weight_packed_;
+  c10::optional<at::Tensor> at_bias_;
+  std::vector<int64_t> padding_;
+  std::vector<int64_t> stride_;
+  std::vector<int64_t> dilation_;
+  int64_t groups_;
+  ideep::attr_t attr_;
+
+  ContextConv() = delete;
+
+  ContextConv(
+      ideep::tensor&& weight_packed,
+      c10::optional<at::Tensor> at_bias,
+      std::vector<int64_t> padding,
+      std::vector<int64_t> stride,
+      std::vector<int64_t> dilation,
+      int64_t groups,
+      ideep::attr_t attr)
+      : weight_packed_(std::move(weight_packed)),
+        at_bias_(std::move(at_bias)),
+        padding_(padding),
+        stride_(stride),
+        dilation_(dilation),
+        groups_(groups),
+        attr_(attr) {}
+};
+
+} // namespace mkldnn
+} // namespace native
+} // namespace at
+
+#endif // AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/mkldnn/ConvPrepack.cpp b/aten/src/ATen/native/mkldnn/ConvPrepack.cpp
@@ -0,0 +1,289 @@
+#include <vector>
+
+#include <ATen/native/ConvUtils.h>
+#include <ATen/native/mkldnn/Common.h>
+#include <ATen/native/mkldnn/ConvPrepack.h>
+#include <ATen/native/mkldnn/MKLDNNCommon.h>
+#include <ATen/native/mkldnn/OpContext.h>
+#include <ATen/native/utils/Factory.h>
+#include <ATen/native/utils/ParamUtils.h>
+#include <c10/util/irange.h>
+
+#if AT_MKLDNN_ENABLED()
+
+namespace at {
+namespace native {
+namespace mkldnn {
+namespace internal {
+namespace convolution {
+
+c10::intrusive_ptr<mkldnn::ConvOpContext> createConvPrePackOpContext(
+    Tensor weight,
+    c10::optional<Tensor> bias,
+    std::vector<int64_t> stride,
+    std::vector<int64_t> padding,
+    std::vector<int64_t> dilation,
+    int64_t groups,
+    std::vector<int64_t> input_size,
+    std::string attr) {
+  auto it = fusion_attr_map.find(attr);
+  TORCH_CHECK(it != fusion_attr_map.end(), "Fusion behavior undefined.");
+  ideep::attr_t op_attr = it->second;
+
+  return mkldnn::MkldnnConvOpContext::create_context(
+      std::move(weight),
+      std::move(bias),
+      std::move(padding),
+      std::move(stride),
+      std::move(dilation),
+      groups,
+      std::move(input_size),
+      op_attr);
+}
+
+ContextConv create(
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    const IntArrayRef padding,
+    const IntArrayRef stride,
+    const IntArrayRef dilation,
+    const int64_t groups,
+    const IntArrayRef input_size,
+    const ideep::attr_t& attr) {
+  auto k = weight.ndimension();
+  int64_t dim = k - 2;
+  const auto padding_expanded = expand_param_if_needed(padding, "padding", dim);
+  const auto stride_expanded = expand_param_if_needed(stride, "stride", dim);
+  const auto dilation_expanded =
+      expand_param_if_needed(dilation, "dilation", dim);
+  const auto input_size_expanded =
+      expand_param_if_needed(input_size, "input_size", k);
+
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  auto w = itensor_view_from_dense(weight);
+  // TODO: what if input is nhwc but w is nchw
+  bool is_channels_last =
+      weight.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
+  ideep::tensor::desc expected_weight_desc =
+      ideep::convolution_forward::expected_weights_desc(
+          w.get_dims(),
+          w.get_data_type(),
+          {stride_expanded.begin(), stride_expanded.end()},
+          {padding_expanded.begin(), padding_expanded.end()},
+          {padding_expanded.begin(), padding_expanded.end()},
+          {dilation_expanded.begin(), dilation_expanded.end()},
+          groups,
+          ideep::algorithm::convolution_direct,
+          ideep::prop_kind::forward,
+          /*x_dtype*/ w.get_data_type(),
+          {input_size_expanded.begin(), input_size_expanded.end()},
+          attr,
+          is_channels_last);
+
+  ideep::tensor packed_weight;
+  packed_weight.init(expected_weight_desc);
+  packed_weight.feed_from(w);
+
+  return ContextConv{
+      std::move(packed_weight),
+      bias.has_value() ? c10::make_optional(*bias) : c10::nullopt,
+      {padding_expanded.begin(), padding_expanded.end()},
+      {stride_expanded.begin(), stride_expanded.end()},
+      {dilation_expanded.begin(), dilation_expanded.end()},
+      groups,
+      std::move(attr)};
+}
+
+void _mkldnn_convolution_out(
+    const ideep::tensor& x,
+    ideep::tensor& y,
+    const ideep::tensor& w,
+    const c10::optional<ideep::tensor>& b,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    IntArrayRef output_sizes,
+    int64_t groups,
+    const ideep::attr_t& attr = ideep::attr_t()) {
+  if (b.has_value()) {
+    ideep::convolution_forward::compute_v2(
+        x,
+        w,
+        b.value(),
+        {output_sizes.cbegin(), output_sizes.cend()},
+        y,
+        {stride.begin(), stride.end()},
+        {dilation.begin(), dilation.end()},
+        {padding.begin(), padding.end()},
+        {padding.begin(), padding.end()},
+        groups,
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::zero_point_t(),
+        ideep::zero_point_t(),
+        attr);
+  } else {
+    ideep::convolution_forward::compute_v2(
+        x,
+        w,
+        {output_sizes.cbegin(), output_sizes.cend()},
+        y,
+        {stride.begin(), stride.end()},
+        {dilation.begin(), dilation.end()},
+        {padding.begin(), padding.end()},
+        {padding.begin(), padding.end()},
+        groups,
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::scale_t(),
+        ideep::zero_point_t(),
+        ideep::zero_point_t(),
+        attr);
+  }
+}
+
+void mkldnn_convolution_out(
+    const Tensor& input,
+    ideep::tensor& mkldnn_output,
+    const ideep::tensor& mkldnn_weight,
+    const c10::optional<Tensor>& bias_opt,
+    IntArrayRef padding,
+    IntArrayRef stride,
+    IntArrayRef dilation,
+    IntArrayRef output_sizes,
+    int64_t groups,
+    const ideep::attr_t& attr = ideep::attr_t()) {
+  c10::MaybeOwned<Tensor> bias_maybe_owned =
+      at::borrow_from_optional_tensor(bias_opt);
+  const Tensor& bias = *bias_maybe_owned;
+
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  const ideep::tensor mkldnn_input = itensor_from_tensor(input);
+  c10::optional<ideep::tensor> mkldnn_bias{c10::nullopt};
+  if (bias.defined()) {
+    mkldnn_bias = itensor_from_tensor(bias);
+  }
+
+  _mkldnn_convolution_out(
+      mkldnn_input,
+      mkldnn_output,
+      mkldnn_weight,
+      mkldnn_bias,
+      padding,
+      stride,
+      dilation,
+      output_sizes,
+      groups,
+      attr);
+}
+
+std::vector<int64_t> get_output_sizes(
+    ContextConv& context,
+    const Tensor& input) {
+  const ideep::tensor& mkldnn_weight = context.weight_packed_;
+  IntArrayRef padding = context.padding_;
+  IntArrayRef stride = context.stride_;
+  IntArrayRef dilation = context.dilation_;
+
+  auto kernel_size = mkldnn_weight.get_dims();
+
+  std::vector<int64_t> input_size = input.sizes().vec();
+  return conv_output_size(input_size, kernel_size, padding, stride, dilation);
+}
+
+Tensor run(ContextConv& context, const Tensor& input) {
+  std::vector<int64_t> output_sizes = get_output_sizes(context, input);
+  auto output = at::empty(
+      output_sizes,
+      input.options().memory_format(input.suggest_memory_format()));
+
+  bool is_channels_last =
+      input.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
+  ideep::tensor y;
+
+  c10::impl::ExcludeDispatchKeyGuard edkg(c10::autograd_dispatch_keyset);
+  ideep::tensor mkldnn_output = itensor_from_tensor(output);
+
+  if (is_channels_last) {
+    mkldnn_convolution_out(
+        input,
+        mkldnn_output,
+        context.weight_packed_,
+        context.at_bias_,
+        context.padding_,
+        context.stride_,
+        context.dilation_,
+        output_sizes,
+        context.groups_,
+        context.attr_);
+  } else {
+    mkldnn_convolution_out(
+        input,
+        y,
+        context.weight_packed_,
+        context.at_bias_,
+        context.padding_,
+        context.stride_,
+        context.dilation_,
+        output_sizes,
+        context.groups_,
+        context.attr_);
+    mkldnn_output.feed_from(y);
+  }
+  return output;
+}
+
+void run(ContextConv& context, const Tensor& input, void* output) {
+  std::vector<int64_t> output_sizes = get_output_sizes(context, input);
+
+  bool is_channels_last =
+      input.suggest_memory_format() == at::MemoryFormat::ChannelsLast;
+  ideep::tensor y;
+
+  ideep::tag o_tag = is_channels_last ? ideep::tag::nhwc : ideep::tag::nchw;
+  ideep::tensor::desc o_desc = {
+      output_sizes, get_mkldnn_dtype(input.scalar_type()), o_tag};
+  ideep::tensor mkldnn_output = {o_desc, output};
+
+  if (is_channels_last) {
+    mkldnn_convolution_out(
+        input,
+        mkldnn_output,
+        context.weight_packed_,
+        context.at_bias_,
+        context.padding_,
+        context.stride_,
+        context.dilation_,
+        output_sizes,
+        context.groups_,
+        context.attr_);
+  } else {
+    mkldnn_convolution_out(
+        input,
+        y,
+        context.weight_packed_,
+        context.at_bias_,
+        context.padding_,
+        context.stride_,
+        context.dilation_,
+        output_sizes,
+        context.groups_,
+        context.attr_);
+    mkldnn_output.feed_from(y);
+  }
+}
+
+Tensor conv_run(
+    const Tensor& input,
+    const c10::intrusive_ptr<mkldnn::ConvOpContext>& op_context) {
+  return op_context->run(input);
+}
+
+} // namespace convolution
+} // namespace internal
+} // namespace mkldnn
+} // namespace native
+} // namespace at
+
+#endif // AT_MKLDNN_ENABLED()
diff --git a/aten/src/ATen/native/mkldnn/ConvPrepack.h b/aten/src/ATen/native/mkldnn/ConvPrepack.h
@@ -0,0 +1,49 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+#include <ATen/native/mkldnn/Common.h>
+#include <ATen/native/mkldnn/OpContext.h>
+
+#if AT_MKLDNN_ENABLED()
+
+namespace at {
+namespace native {
+namespace mkldnn {
+namespace internal {
+namespace convolution {
+
+c10::intrusive_ptr<mkldnn::ConvOpContext> createConvPrePackOpContext(
+    Tensor weight,
+    c10::optional<Tensor> bias,
+    std::vector<int64_t> stride,
+    std::vector<int64_t> padding,
+    std::vector<int64_t> dilation,
+    int64_t groups,
+    std::vector<int64_t> input_size,
+    std::string attr);
+
+Tensor conv_run(
+    const Tensor& input,
+    const c10::intrusive_ptr<mkldnn::ConvOpContext>& op_context);
+
+ContextConv create(
+    const Tensor& weight,
+    const c10::optional<Tensor>& bias,
+    const IntArrayRef padding,
+    const IntArrayRef stride,
+    const IntArrayRef dilation,
+    const int64_t groups,
+    const IntArrayRef input_size,
+    const ideep::attr_t& attr);
+
+Tensor run(ContextConv& context, const Tensor& input);
+
+void run(ContextConv& context, const Tensor& input, void* output);
+
+} // namespace convolution
+} // namespace internal
+} // namespace mkldnn
+} // namespace native
+} // namespace at
+
+#endif // AT_MKLDNN_ENABLED()