expand RemoveBcastSqueeze to handle unary operations between broadcast/squeeze ops #3643

jjsjann123 · 2024-12-24T21:01:55Z

Existing RemoveBcastSqueeze optimization pass only handles consecutive broadcast/squeeze ops. This PR expand the pass to handle cases where simple unary operations are separating broadcast/squeeze ops.

e.g.

T1 = broadcast(T0)
T2 = relu(T1)
T3 = squeeze(T2)

In this PR, we update it so that, when we see a pattern where a replaceable expr is followed by a unary op. we swap the two operations, effectively pushing replaceable exprs towards inputs, hoping they will encounter another replaceable exprs and we would be able to merge them together.

In the example above, we'll replace T2 = relu(T1); T3 = squeeze(T2) as T2 = squeeze(T1); T3 = relu(T2). In the next iteration, we'll be able to merge the broadcast and squeeze op together, since they are now consecutive operations.

jjsjann123 · 2024-12-24T21:02:23Z

!test

jjsjann123 · 2024-12-30T22:16:12Z

!test

jjsjann123 · 2024-12-31T19:44:53Z

I don't quite understand why the CI failure I'm seeing here doesn't show up on other PRs.

The repro does fail on main opened #3660 for the failure.

jjsjann123 · 2024-12-31T19:45:05Z

!test

jjsjann123 · 2025-01-02T07:00:00Z

!test

This reverts commit 1f38531.

jjsjann123 · 2025-01-03T00:09:24Z

!test

jjsjann123 · 2025-01-03T01:16:31Z

!test

naoyam · 2025-01-03T06:10:28Z

Just in case, this one:

In the next iteration, we'll be able to merge the broadcast and squeeze op together, since they are not consecutive operations.

You meant that the broadcast and squeeze ops are going to be removed as they are consecutive, right?

naoyam · 2025-01-03T06:37:09Z

csrc/preseg_passes/remove_bcast_squeeze.cpp

@@ -318,13 +339,79 @@ TensorView* maybeDoReplacement(TensorView* orig) {
  if (!isReplaceableExpr(second)) {
    return orig;
  }
+  AxisOps second_ops = exprToAxisOps(second);


Having a hard time to understand what this function (maybeDoReplacement) is doing. What is the parameter assumed to be? What is supposed to be returned?

I think maybeDoReplacement is trying to merge tv->first->second->orig as a tv->merged->new_out when both first and second are replaceable exprs.

i.e. when we have tv->broadcast->squeeze, we might be able to just cancel the two and ended up returning a tv directly.

The function returns the new_out after the replay. The logic here is that:
if the returned pointer is different from orig, it would consider a replacement has happened and would try to the same loop with new_out;
if the returned pointer is the same as orig, merge failed, it would skip second here and move on and push inputs to second as new candidate as orig in the stack.

So the added logic here is, when we try to swap tv->first->second->orig as tv->replayed_second->replayed_first, we return replayed_second->output(0).

Even though we are not merging two consecutive replaceable operations, by returning replayed_second->output(0) instead of orig, we kept replayed_second as the candidate for the iteration, effectively skipped unary-op first from preventing us merging neighboring replaceable operations.

naoyam · 2025-01-03T06:37:38Z

csrc/preseg_passes/remove_bcast_squeeze.cpp

+        return orig;
+      }
+
+      // make sure we preserve the allcoation domain on second->output(0)


Why does the allocation domain matter?

Answered in the example below. I think I can use another comment here as well.

naoyam · 2025-01-03T06:38:12Z

csrc/preseg_passes/remove_bcast_squeeze.cpp

+
+      // make sure we preserve the allcoation domain on second->output(0)
+      // initializing alloc_domain permutation of second output.
+      auto second_out_tv = second->output(0)->as<TensorView>();


Is this second_out_tv always the same as orig?

Yes. I now realized that I could have just used orig instead.

naoyam · 2025-01-03T06:50:11Z

I'm not against the approach of this PR, but it's much more complicated than I thought. I think if we could just remove a sequence of broadcast -> cast-to-fp32 -> squeeze -> cast-to-bf16, that would probably be enough. I suppose the cast is added because the squeeze is originally a reduction.

jjsjann123 · 2025-01-03T06:50:45Z

Just in case, this one:

In the next iteration, we'll be able to merge the broadcast and squeeze op together, since they are not consecutive operations.

You meant that the broadcast and squeeze ops are going to be removed as they are consecutive, right?

Yes. Thanks for catching that. updated.

jjsjann123 · 2025-01-03T06:55:12Z

I'm not against the approach of this PR, but it's much more complicated than I thought. I think if we could just remove a sequence of broadcast -> cast-to-fp32 -> squeeze -> cast-to-bf16, that would probably be enough. I suppose the cast is added because the squeeze is originally a reduction.

Yes the extra cast is added because of the trivial reduction.
This PR by itself actually doesn't remove the cast ops, since removing consecutive cast pass runs before the remove broadcast squeeze pass. So I actually extended this to the other pass as well in #3644 .

The alternative is just re-order the two passes, as well as your suggested pattern matching. But this feels like a little bit more robust.

jjsjann123 · 2025-01-03T07:14:23Z

tests/cpp/test_preseg_passes.cpp

+  std::vector<IterDomain*> tv3_nhwc = {
+      tv3->axis(0), tv3->axis(2), tv3->axis(3), tv3->axis(1)};
+  tv3->setAllocationDomain(tv3_nhwc, true);
+  fusion.addOutput(tv3);


This is the reason why we care about allocation domain.

i.e. tv1->relu->tv2->squeeze->tv3. Here tv3 has an allocation domain that's a permutation.
when we replace it as tv1->replayed_squeeze->tv4->replayed_relu->tv5. We need to ensure that tv5 has the same allocation domain as with tv3, otherwise we are going to change the semantics and return an output with the wrong memory format.

I'm not saying we should ignore the allocation domain. I just don't see why having an allocation domain can interfere this translation. Why not just keep using tv3? Or, it should also be possible to reproduce the same allocation domain with tv5.

I mistaken what you meant in your earlier question!

Why not just keep using tv3?

By keep using tv3, do you mean that I can have it replayed as tv1->replayed_squeeze->tv4->replayed_relu->tv3, I didn't realized that I can just re-use tv3 here, without needing to create a clone of it. Let me try that...

Or, it should also be possible to reproduce the same allocation domain with tv5.

Yes. I was just trying to keep it simple. If we want to support general transformations, I think I can just do the same replay I did in #3644 https://github.com/NVIDIA/Fuser/pull/3644/files#diff-abe2e10add90523ff6b18e1dc50da46762420e1011078ba47ab52140dc213b6fR80-R85.

jjsjann123 · 2025-01-03T07:15:43Z

tests/cpp/test_preseg_passes.cpp

+  FusionExecutorCache executor_cache(std::move(fusion_ptr));
+  auto outputs = executor_cache.runFusionWithInputs(inputs);
+  // validate output permutation is preserved
+  ASSERT_TRUE(outputs[0].is_contiguous(at::MemoryFormat::ChannelsLast));


Without the allocation domain update, this check would fail and it's optimization pass changing the user intended behavior.

naoyam · 2025-01-03T07:44:48Z

I'm not against the approach of this PR, but it's much more complicated than I thought. I think if we could just remove a sequence of broadcast -> cast-to-fp32 -> squeeze -> cast-to-bf16, that would probably be enough. I suppose the cast is added because the squeeze is originally a reduction.

Yes the extra cast is added because of the trivial reduction. This PR by itself actually doesn't remove the cast ops, since removing consecutive cast pass runs before the remove broadcast squeeze pass. So I actually extended this to the other pass as well in #3644 .

The alternative is just re-order the two passes, as well as your suggested pattern matching. But this feels like a little bit more robust.

It' certainly more generalized, but do we know if there's any actual case where this and #3466 would help besides the straight-line pattern of broadcast, cast, squeeze and cast? I'm just feeling it seems a little over-engineering for a simple task like removing the particular pattern if there isn't any other impact.

jjsjann123 · 2025-01-03T10:13:46Z

but do we know if there's any actual case where this and #3466 would help besides the straight-line pattern of broadcast, cast, squeeze and cast? I'm just feeling it seems a little over-engineering for a simple task like removing the particular pattern if there isn't any other impact.

If I'm hearing this correctly, the concern is the impact of the aggressive reorder? That's a hard argument for me to win over. But let me give it a shot.

In the backward graph, we could encounter this squeeze + broadcast pattern pretty often and they might not naturally always cancel each other out. See the grad rule for broadcast_in_dim in thunder.

In the origin issue #3635, we have the pattern here vvv

T63_g___bfloat[bS260{1}, iS261{8}, bS262{1}, iS263{4096}, iS264{128}]
   = broadcast( T54_g___bfloat[iS221{8}, iS222{4096}, iS223{128}] )
(74)
T71_l_float[bS294{1}, iS295{8}, bS296{1}, iS297{4096}, iS298{128}]
   = __bfloat2float(T63_g___bfloat[bS260{1}, iS261{8}, bS262{1}, iS263{4096}, iS264{128}]);
(162)
T76_g_float[iS315{8}, iS316{4096}, iS317{128}]
   = squeeze( T71_l_float[bS294{1}, iS295{8}, bS296{1}, iS297{4096}, iS298{128}] )
(87)
T79_g___bfloat[iS326{8}, iS327{4096}, iS328{128}]
   = __float2bfloat(T76_g_float[iS315{8}, iS316{4096}, iS317{128}]);
(90)
T82_l___bfloat[bS337{1}, iS338{8}, iS339{4096}, iS340{128}]
   = broadcast( T79_g___bfloat[iS326{8}, iS327{4096}, iS328{128}] )
(93)

I think the real trouble-some pattern here is upCast -> squeeze -> downCast, which we should replace it with squeeze instead.
So if we are going to do that, we'd want to apply this pattern matching first before we consider merging T82 = broadcast(T79) to its producer first.
So this feels like more natural to add another peephole optimization to apply the pattern matching, before the remove_bcast_squeeze pass.

But this might not be enough, the sum operation could also contain both trivial reduction that translates to squeeze as well as meaningful reduction op. In the same issue #3635, we actually also see this pattern.

T42_l_float[bS171{1}, iS172{8}, iS173{4}, iS174{4096}, iS175{128}]
   = __bfloat2float(T38_l___bfloat[bS153{1}, iS158{8}rf, iS159{4}rf, iS155{4096}, iS156{128}]);
T46_l_float[iS187{8}, iS188{4}, iS189{4096}, iS190{128}]
   = squeeze( T42_l_float[bS171{1}, iS172{8}, iS173{4}, iS174{4096}, iS175{128}] )
T47_l_float[iS191{8}, rS192{4}, iS193{4096}, iS194{128}]
   = reduction( T46_l_float[iS187{8}, iS188{4}, iS189{4096}, iS190{128}], op = add, initial value = float(0), allreduce = false )
T54_l___bfloat[iS221{8}, iS222{4096}, iS223{128}]
   = __float2bfloat(T47_l_float[iS191{8}, rS192{4}, iS193{4096}, iS194{128}]);

In that example, we do not have another broadcast before T38, but if that is the case, we would want to be able to re-order the __bfloat2float -> squeeze so that we can have the squeeze merged with the meta op before the cast.

naoyam · 2025-01-06T06:41:55Z

but do we know if there's any actual case where this and #3466 would help besides the straight-line pattern of broadcast, cast, squeeze and cast? I'm just feeling it seems a little over-engineering for a simple task like removing the particular pattern if there isn't any other impact.

If I'm hearing this correctly, the concern is the impact of the aggressive reorder? That's a hard argument for me to win over. But let me give it a shot.

In the backward graph, we could encounter this squeeze + broadcast pattern pretty often and they might not naturally always cancel each other out. See the grad rule for broadcast_in_dim in thunder.

In the origin issue #3635, we have the pattern here vvv
T63_g___bfloat[bS260{1}, iS261{8}, bS262{1}, iS263{4096}, iS264{128}]
   = broadcast( T54_g___bfloat[iS221{8}, iS222{4096}, iS223{128}] )
(74)
T71_l_float[bS294{1}, iS295{8}, bS296{1}, iS297{4096}, iS298{128}]
   = __bfloat2float(T63_g___bfloat[bS260{1}, iS261{8}, bS262{1}, iS263{4096}, iS264{128}]);
(162)
T76_g_float[iS315{8}, iS316{4096}, iS317{128}]
   = squeeze( T71_l_float[bS294{1}, iS295{8}, bS296{1}, iS297{4096}, iS298{128}] )
(87)
T79_g___bfloat[iS326{8}, iS327{4096}, iS328{128}]
   = __float2bfloat(T76_g_float[iS315{8}, iS316{4096}, iS317{128}]);
(90)
T82_l___bfloat[bS337{1}, iS338{8}, iS339{4096}, iS340{128}]
   = broadcast( T79_g___bfloat[iS326{8}, iS327{4096}, iS328{128}] )
(93)
I think the real trouble-some pattern here is upCast -> squeeze -> downCast, which we should replace it with squeeze instead. So if we are going to do that, we'd want to apply this pattern matching first before we consider merging T82 = broadcast(T79) to its producer first. So this feels like more natural to add another peephole optimization to apply the pattern matching, before the remove_bcast_squeeze pass.

I'm just commenting from the principle of KISS. I'd just create a new pass that would detect the four-op pattern and remove them. That'd be it.

But this might not be enough, the sum operation could also contain both trivial reduction that translates to squeeze as well as meaningful reduction op. In the same issue #3635, we actually also see this pattern.
T42_l_float[bS171{1}, iS172{8}, iS173{4}, iS174{4096}, iS175{128}]
   = __bfloat2float(T38_l___bfloat[bS153{1}, iS158{8}rf, iS159{4}rf, iS155{4096}, iS156{128}]);
T46_l_float[iS187{8}, iS188{4}, iS189{4096}, iS190{128}]
   = squeeze( T42_l_float[bS171{1}, iS172{8}, iS173{4}, iS174{4096}, iS175{128}] )
T47_l_float[iS191{8}, rS192{4}, iS193{4096}, iS194{128}]
   = reduction( T46_l_float[iS187{8}, iS188{4}, iS189{4096}, iS190{128}], op = add, initial value = float(0), allreduce = false )
T54_l___bfloat[iS221{8}, iS222{4096}, iS223{128}]
   = __float2bfloat(T47_l_float[iS191{8}, rS192{4}, iS193{4096}, iS194{128}]);
In that example, we do not have another broadcast before T38, but if that is the case, we would want to be able to re-order the __bfloat2float -> squeeze so that we can have the squeeze merged with the meta op before the cast.

IIUC, it doesn't seem to matter if there's both a real reduction and a squeeze. It seems what you're suggesting is the capability of moving squeeze ops would be helpful even without a preceding broadcast op.

Assuming my understanding is correct, I wouldn't disagree with the idea, but I am also not clear why we shouldn't just leave the squeeze op there. Does the reduction scheduler have any issue with it? If so, should we focus on fixing it rather than avoiding it? If there's no particular issue, why would the benefit of the reordering outweigh the optimization pass getting even more complicated?

jjsjann123 added 8 commits December 24, 2024 12:00

adding a replay for unary ops

8de5f09

WIP

e4ee915

missing type; missing header

daf65b4

WIP

5ccfc8c

WIP

7596bf0

WIP

26e9330

WIP

9fc0595

WIP

f5f5c53

Merge branch 'main' into preseg_passes_broadcast_squeeze

b067a6c

Merge branch 'main' into preseg_passes_broadcast_squeeze

8f8eb10

Merge branch 'main' into preseg_passes_broadcast_squeeze

bb39031

jjsjann123 added 2 commits January 1, 2025 18:34

allow squeeze expanded dimensions

b70bc03

WIP

1f38531

jjsjann123 added 8 commits January 2, 2025 11:20

Merge branch 'main' into preseg_passes_broadcast_squeeze

6d6deea

comment/code cleaning

7a2a561

Revert "WIP"

052ddc4

This reverts commit 1f38531.

adding tests

a3666b2

fixing test

f91a646

supporting allocation domain permutation

c89e318

typo

537471f

Merge branch 'main' into preseg_passes_broadcast_squeeze

32667ca

jjsjann123 changed the title ~~Preseg passes broadcast squeeze~~ expand removing consecutive cast to handle meta operations in between Jan 3, 2025

jjsjann123 changed the title ~~expand removing consecutive cast to handle meta operations in between~~ expand removing consecutive cast to handle unary operations in between Jan 3, 2025

jjsjann123 changed the title ~~expand removing consecutive cast to handle unary operations in between~~ expand RemoveBcastSqueeze to handle unary operations between broadcast/squeeze ops Jan 3, 2025

adding comment

40fab7b

jjsjann123 marked this pull request as ready for review January 3, 2025 01:16

jjsjann123 requested review from naoyam and jacobhinkle January 3, 2025 01:16

naoyam reviewed Jan 3, 2025

View reviewed changes

jjsjann123 commented Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expand RemoveBcastSqueeze to handle unary operations between broadcast/squeeze ops #3643

expand RemoveBcastSqueeze to handle unary operations between broadcast/squeeze ops #3643

jjsjann123 commented Dec 24, 2024 •

edited

Loading

jjsjann123 commented Dec 24, 2024

jjsjann123 commented Dec 30, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Jan 2, 2025

jjsjann123 commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025

naoyam commented Jan 3, 2025

naoyam Jan 3, 2025

jjsjann123 Jan 3, 2025

jjsjann123 Jan 3, 2025

naoyam Jan 3, 2025

jjsjann123 Jan 3, 2025

naoyam Jan 3, 2025

jjsjann123 Jan 3, 2025

naoyam commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025

jjsjann123 Jan 3, 2025

naoyam Jan 3, 2025

jjsjann123 Jan 3, 2025

jjsjann123 Jan 3, 2025

naoyam commented Jan 3, 2025 •

edited

Loading

jjsjann123 commented Jan 3, 2025

naoyam commented Jan 6, 2025

expand RemoveBcastSqueeze to handle unary operations between broadcast/squeeze ops #3643

Are you sure you want to change the base?

expand RemoveBcastSqueeze to handle unary operations between broadcast/squeeze ops #3643

Conversation

jjsjann123 commented Dec 24, 2024 • edited Loading

jjsjann123 commented Dec 24, 2024

jjsjann123 commented Dec 30, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Jan 2, 2025

jjsjann123 commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025

naoyam commented Jan 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented Jan 3, 2025 • edited Loading

jjsjann123 commented Jan 3, 2025

naoyam commented Jan 6, 2025

jjsjann123 commented Dec 24, 2024 •

edited

Loading

naoyam commented Jan 3, 2025 •

edited

Loading