Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3606

samnordmann · 2024-12-18T08:35:59Z

Stacked on top of

What

Lower a MatmulOp sharded on the first inner axis into a pipelined AG+GEMM algorithm achieving fine grained overlap.

We introduce a new parallel type Stream to account for this scheduling.

More precisely, this patch enables lowering the fusion:

  TensorView* a = makeContigTensor(4); //[S, DIDx(D), M/(S*d), K]
  TensorView* b = makeContigTensor(2); //[K, N]
  TensorView* c = matmul(a, b); //[S, D, M/(S*D), N]

  fusion->addInput(a);
  fusion->addInput(b);
  fusion->addOutput(c);

  auto mesh = DeviceMesh::createForNumDevices(D);
  a->setDeviceMesh(mesh);
  b->setDeviceMesh(mesh);
  c->setDeviceMesh(mesh);

  a->axis(1)->parallelize(ParallelType::DIDx);
  c->axis(0)->parallelize(ParallelType::Stream);

to the Host Ir program (obtained from dump, using NVFUSER_DUMP=host_ir)

%HostIrContainer { (T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})) :
  GetCurrentStream into Stream 0
  T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i4 ), zero_init=false, resets_to_zero=fals
e)
  T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i6 ), zero_init=fals
e, resets_to_zero=false)
  FOR i104 in iS0{i0}:
    SetCurrentStream to Stream ( i104 % numberOfStreams )
    T4_l_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS0{i0}, index = i104 )
    T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS11{i0}, index = i104 )
    Communication 46 (type=Allgather, team=(0 1 2 3 4 5 6 7), input=T4_l_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), output=T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}))
    Wait Communication 46
    T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iStream6{i0}, index = i104 )
    T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = matmul(T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}),
                T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7}))
    SetCurrentStream to Stream 0
    Synchronize Stream ( i104 % numberOfStreams )
} // %HostIrContainer

The nsight profile shows that we do achieve overlap, in a way that is comparable to the Aten overlap experiments

…lower_matmul_to_hostir

samnordmann · 2024-12-18T13:35:55Z

!test

csrc/host_ir/lower.cpp

# What Make stream synchronization non-blocking from the CPU point of view # Why Needed for achieving overlap in - #3606 before this patch: ![Screenshot 2024-12-18 at 12 08 25](https://github.com/user-attachments/assets/f5c84282-ea85-4cb8-8a60-538cd91cfa1c) after this patch ![Screenshot 2024-12-18 at 12 08 05](https://github.com/user-attachments/assets/25537a5d-3e33-4ff8-baf4-4f013c1ed230) # How Before this patch, the host IR `Synchronize` would call `c10::synchronize()` on the cuda stream, which makes the CPU blocks until stream completion. With this patch, we synchronize the current stream with a given stream through a `cudaEvent` and the API `cudaStreamWaitEvent`.

…ent_stream

# What adds the primitive `GetCurrentStream` to Host Ir stack. # Why needed for - #3606 The idea is that if we want to use multiple stream internally, we need before hand to capture the user stream and to set it back to being the active stream when returning

…to_hostir

…mul_to_hostir

samnordmann · 2025-01-08T18:28:59Z

again, I had to add a couple of additional small fixes to account for some other tests...

samnordmann · 2025-01-08T18:29:04Z

!test

samnordmann · 2025-01-08T18:36:06Z

csrc/multidevice/utils.cpp

-    is_sharded = true;
+
+    if (alloc_id->isReduction()) {
+      is_reduction_sharded = true;


is_reduction_sharded is only used here to check that there are not two axis DID-sharded. I am not convinced the checks necessarily needs to be done in this function. Another option could be to modify ShardingTest.ReductionShouldNotBeSharded

tests/cpp/test_multidevice_matmul.cpp

csrc/host_ir/lower.cpp

wujingyue · 2025-01-08T21:00:14Z

csrc/preseg_passes/reorder_sharded_axis.cpp

+    if (HostIrLower::canLower(expr)) {
+      continue;
+    }
+    if (expr->outputs().size() > 1 || expr->inputs().size() > 1) {


After this patch, we don't throw, we only pass.

To rephrase my previous comment, I was trying to say this is a wrong change. InsertReshardingsPass (which runs before ReorderShardedAxis) should already have decomposed each resharding expression into local expressions and resharding expressions that can be lowered to a communication (modulo the axis order which this pass tries to fix). All communications today takes one TV and produces one TV, so there's nothing wrong with the old code here to error out when seeing a multiple-I/O resharding expression.

Therefore, I was trying to understand what triggered you to make this change. Was it to work around a limitation in InsertReshardingsPass?

That is not correct, the stream axis is fully allocated.

(I brought this up but this no longer matters for the current discussion around multiple I/O. But still I'd like to point out a potential misconception so we can be on the same page for the future!)

I don't think so. A stream-parallelized IterDomain in allocation (in your unit test the same as loop and logical) means the allocation for that axis is sharded, similar to how nvFuser deals with TID and BID. For the allgather output, the allocation ought to be size [1, D, M/S/D, K] and it ought to be done inside the for loop. When we aggressively run each loop iteration on a different stream, the total allocation will be the same as [S, D, M/S/D, K]; however, SOTA tends to limit concurrent streams so allocation is less than that. E.g., a double-buffer approach allocates [2, D, M/S/D, K].

That said, I understand your current implementation fully allocates the allgather output outside the loop. It's valid just suboptimal. To represent that, I'd like the loop to be stream parallelized but allocation to not be stream parallelized. However, doing so today may run into problems as we don't support DID loop split. So I'm definitely OK with some temporary workarounds.

csrc/preseg_passes/reorder_sharded_axis.cpp

samnordmann · 2025-01-08T22:20:26Z

The CI is finally green!

But @wujingyue I'll wait your final word before merging

wujingyue · 2025-01-08T22:40:57Z

csrc/multidevice/utils.cpp

@@ -100,7 +100,7 @@ std::pair<std::vector<IterDomain*>, std::vector<IterDomain*>> getShardingChanges
 bool isSharded(const TensorView* tv) {
  bool is_sharded = false;
  for (IterDomain* alloc_id : tv->getMaybeAllocationDomain()) {
-    if (!alloc_id->isDeviceDim()) {
+    if (!alloc_id->isDeviceDim() || alloc_id->isReduction()) {


Incidental note:

Fuser/csrc/ops/arith.cpp

Line 1203 in 9ce2112

.iter_type(IterType::Reduction)

, what you pointed out, is correct. resetSchedulingParams() resets the parallel type to Serial so it'll be r instead of rDID.

wujingyue · 2025-01-08T23:39:14Z

csrc/multidevice/utils.cpp

@@ -100,7 +100,7 @@ std::pair<std::vector<IterDomain*>, std::vector<IterDomain*>> getShardingChanges
 bool isSharded(const TensorView* tv) {
  bool is_sharded = false;
  for (IterDomain* alloc_id : tv->getMaybeAllocationDomain()) {
-    if (!alloc_id->isDeviceDim()) {
+    if (!alloc_id->isDeviceDim() || alloc_id->isReduction()) {


Another more subtle case occurs from InsertReshardingsPass

This is related to the isInnerResharding change in this PR, so I'll comment over there.

csrc/host_ir/lower.cpp

wujingyue · 2025-01-08T23:57:27Z

csrc/multidevice/utils.cpp

@@ -100,7 +100,7 @@ std::pair<std::vector<IterDomain*>, std::vector<IterDomain*>> getShardingChanges
 bool isSharded(const TensorView* tv) {
  bool is_sharded = false;
  for (IterDomain* alloc_id : tv->getMaybeAllocationDomain()) {
-    if (!alloc_id->isDeviceDim()) {
+    if (!alloc_id->isDeviceDim() || alloc_id->isReduction()) {


we should assert that this case is not encountered

Yep, it's unfortunately one of the many places in nvFuser where a contract is not fully verified. And PRs are welcomed. In the meantime, how about moving

Fuser/csrc/preseg_passes/propagate_shardings.cpp

Line 137 in e172781

id->parallelize(ParallelType::Serial);

to shardAllLike? It has been the biggest source of rDID.

…mul_to_hostir

…hardingsPass

samnordmann · 2025-01-09T11:27:09Z

!test

…redicate

samnordmann · 2025-01-09T16:13:07Z

!test

samnordmann · 2025-01-09T17:09:32Z

!test

wujingyue

LGTM

csrc/host_ir/lower.cpp

csrc/host_ir/lower.h

…mul_to_hostir

samnordmann · 2025-01-13T11:02:56Z

!test

Host IR: add GetCurrentStream

38721fe

samnordmann mentioned this pull request Dec 18, 2024

Host IR: add GetCurrentStream #3605

Merged

samnordmann added 2 commits December 18, 2024 00:46

lint

c4ca266

lower to collective base pipeline AG+GEMM

b517c2b

samnordmann force-pushed the overlap/lower_matmul_to_hostir branch from bb867e8 to b517c2b Compare December 18, 2024 08:47

samnordmann added 4 commits December 18, 2024 00:48

lint

92ab927

lint

ed4440a

update with non blocking stream synchronization

ef8f00c

make stream synchronization non blocking

36fd2be

samnordmann mentioned this pull request Dec 18, 2024

Host IR: make stream synchronization non blocking #3608

Merged

samnordmann added 4 commits December 18, 2024 03:34

lint

1e9f1d0

add event to events_ container

af06de4

destroy event async at create site

5e166a0

Merge branch 'host_irs/non_blocking_stream_synchronize' into overlap/…

e8ffadb

…lower_matmul_to_hostir

nsarka reviewed Dec 20, 2024

View reviewed changes

csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved

nsarka reviewed Dec 20, 2024

View reviewed changes

csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved

nsarka reviewed Dec 20, 2024

View reviewed changes

csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved

samnordmann added 2 commits December 23, 2024 03:04

minor review

741202b

Merge branch 'main' of github.com:NVIDIA/Fuser into host_irs/get_curr…

353c03c

…ent_stream

samnordmann added 2 commits December 23, 2024 05:51

Merge branch 'host_irs/get_current_stream' into overlap/lower_matmul_…

4420eb4

…to_hostir

Merge branch 'main' of github.com:NVIDIA/Fuser into overlap/lower_mat…

d0a9340

…mul_to_hostir

samnordmann added the Multi-GPU label Dec 23, 2024

samnordmann added 5 commits December 23, 2024 06:25

fix merge

0374604

minor review

5e07ad8

remove now unnecessary trick of adding artifical outputs

b546dce

lint

8e8b247

remove now unnecessary patch on broadcast

d5b42c2

samnordmann added 3 commits January 8, 2025 10:17

increase tolerance rate

caf5d0b

still throws if two axis are DIDx, even if one is reduced

5b6c7bd

support multiple additions/deletions in isInnerResharding

632aa1e

samnordmann commented Jan 8, 2025

View reviewed changes

wujingyue reviewed Jan 8, 2025

View reviewed changes

csrc/preseg_passes/reorder_sharded_axis.cpp Show resolved Hide resolved

wujingyue reviewed Jan 8, 2025

View reviewed changes

samnordmann added 9 commits January 9, 2025 02:39

Merge branch 'main' of github.com:NVIDIA/Fuser into overlap/lower_mat…

4dde7d7

…mul_to_hostir

use randint and small sizes in DistributedMatmulTest.AnnotateWeightOnly

8f60b45

add bool option ignore_inner_resharding in canLower

cdd9e46

de-DID-parallelize reduction axis in shardAllLike

6e9fe35

revert patch on isSharded

428600d

revert patch on getShardedLogicalAxis and isInnerResharding

e70c00f

revert accepting multiple IO in ReorderShardedAxisPass

0f93a43

revert switching order of passes ReorderShardedAxisPass and InsertRes…

6b11d33

…hardingsPass

lint

061955f

move ignore_inner_resharding as LHS of bool op to lazy evaluate the p…

1a83908

…redicate

wujingyue approved these changes Jan 9, 2025

View reviewed changes

csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved

wujingyue approved these changes Jan 9, 2025

View reviewed changes

csrc/host_ir/lower.h Show resolved Hide resolved

samnordmann added 3 commits January 13, 2025 03:00

minor comments

d506dde

Merge branch 'main' of github.com:NVIDIA/Fuser into overlap/lower_mat…

a367238

…mul_to_hostir

lint

ea55060

samnordmann merged commit 33366f9 into NVIDIA:main Jan 13, 2025
40 of 41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3606

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3606

samnordmann commented Dec 18, 2024 •

edited

Loading

samnordmann commented Dec 18, 2024

samnordmann commented Jan 8, 2025

samnordmann commented Jan 8, 2025

samnordmann Jan 8, 2025

wujingyue Jan 8, 2025

samnordmann commented Jan 8, 2025

wujingyue Jan 8, 2025

wujingyue Jan 8, 2025

wujingyue Jan 8, 2025

samnordmann commented Jan 9, 2025

samnordmann commented Jan 9, 2025

samnordmann commented Jan 9, 2025

wujingyue left a comment

samnordmann commented Jan 13, 2025

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3606

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3606

Conversation

samnordmann commented Dec 18, 2024 • edited Loading

What

samnordmann commented Dec 18, 2024

samnordmann commented Jan 8, 2025

samnordmann commented Jan 8, 2025

samnordmann Jan 8, 2025

Choose a reason for hiding this comment

wujingyue Jan 8, 2025

Choose a reason for hiding this comment

samnordmann commented Jan 8, 2025

wujingyue Jan 8, 2025

Choose a reason for hiding this comment

wujingyue Jan 8, 2025

Choose a reason for hiding this comment

wujingyue Jan 8, 2025

Choose a reason for hiding this comment

samnordmann commented Jan 9, 2025

samnordmann commented Jan 9, 2025

samnordmann commented Jan 9, 2025

wujingyue left a comment

Choose a reason for hiding this comment

samnordmann commented Jan 13, 2025

samnordmann commented Dec 18, 2024 •

edited

Loading