Host IR: add `GetCurrentStream` #3605

samnordmann · 2024-12-18T08:34:52Z

What

adds the primitive GetCurrentStream to Host Ir stack.

Why

needed for

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3606

The idea is that if we want to use multiple stream internally, we need before hand to capture the user stream and to set it back to being the active stream when returning

samnordmann · 2024-12-18T08:37:47Z

!test

samnordmann · 2024-12-18T09:28:40Z

!test

csrc/host_ir/host_ir.cpp

csrc/host_ir/executor.cpp

…ent_stream

samnordmann · 2024-12-23T11:51:03Z

!test

…lap: AG+GEMM layout (#3606) Stacked on top of - [x] #3608 - [x] #3605 # What Lower a MatmulOp sharded on the first inner axis into a pipelined AG+GEMM algorithm achieving fine grained overlap. We introduce a new parallel type `Stream` to account for this scheduling. More precisely, this patch enables lowering the fusion: ``` TensorView* a = makeContigTensor(4); //[S, DIDx(D), M/(S*d), K] TensorView* b = makeContigTensor(2); //[K, N] TensorView* c = matmul(a, b); //[S, D, M/(S*D), N] fusion->addInput(a); fusion->addInput(b); fusion->addOutput(c); auto mesh = DeviceMesh::createForNumDevices(D); a->setDeviceMesh(mesh); b->setDeviceMesh(mesh); c->setDeviceMesh(mesh); a->axis(1)->parallelize(ParallelType::DIDx); c->axis(0)->parallelize(ParallelType::Stream); ``` to the Host Ir program (obtained from dump, using `NVFUSER_DUMP=host_ir`) ``` %HostIrContainer { (T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})) : GetCurrentStream into Stream 0 T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i4 ), zero_init=false, resets_to_zero=fals e) T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i6 ), zero_init=fals e, resets_to_zero=false) FOR i104 in iS0{i0}: SetCurrentStream to Stream ( i104 % numberOfStreams ) T4_l_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = select( T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS0{i0}, index = i104 ) T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = select( T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS11{i0}, index = i104 ) Communication 46 (type=Allgather, team=(0 1 2 3 4 5 6 7), input=T4_l_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), output=T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})) Wait Communication 46 T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7}) = select( T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iStream6{i0}, index = i104 ) T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7}) = matmul(T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})) SetCurrentStream to Stream 0 Synchronize Stream ( i104 % numberOfStreams ) } // %HostIrContainer ``` The nsight profile shows that we do achieve overlap, in a way that is comparable to the Aten overlap experiments ![Screenshot 2024-12-18 at 12 08 05](https://github.com/user-attachments/assets/75e37822-a78d-49e6-a644-4fb99c40e945)

Host IR: add GetCurrentStream

38721fe

samnordmann requested a review from wujingyue December 18, 2024 08:38

samnordmann added 2 commits December 18, 2024 00:46

lint

c4ca266

lint

ed4440a

samnordmann mentioned this pull request Dec 18, 2024

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3606

Merged

2 tasks

samnordmann requested a review from nsarka December 18, 2024 11:38

wujingyue approved these changes Dec 19, 2024

View reviewed changes

csrc/host_ir/host_ir.cpp Outdated Show resolved Hide resolved

csrc/host_ir/executor.cpp Outdated Show resolved Hide resolved

samnordmann added 2 commits December 23, 2024 03:04

minor review

741202b

Merge branch 'main' of github.com:NVIDIA/Fuser into host_irs/get_curr…

353c03c

…ent_stream

samnordmann merged commit 99fb12b into NVIDIA:main Dec 23, 2024
32 of 34 checks passed

xwang233 mentioned this pull request Jan 10, 2025

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3695

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host IR: add `GetCurrentStream` #3605

Host IR: add `GetCurrentStream` #3605

samnordmann commented Dec 18, 2024 •

edited

Loading

samnordmann commented Dec 18, 2024

samnordmann commented Dec 18, 2024

samnordmann commented Dec 23, 2024

Host IR: add GetCurrentStream #3605

Host IR: add GetCurrentStream #3605

Conversation

samnordmann commented Dec 18, 2024 • edited Loading

What

Why

samnordmann commented Dec 18, 2024

samnordmann commented Dec 18, 2024

samnordmann commented Dec 23, 2024

Host IR: add `GetCurrentStream` #3605

Host IR: add `GetCurrentStream` #3605

samnordmann commented Dec 18, 2024 •

edited

Loading