Is DeepGEMM directly applicable to backward in training? #10

YouJiacheng · 2025-02-26T07:02:37Z

bwd of GEMM is two GEMMs, but I wonder if I need to take some special care of the range of gradients?

YouJiacheng · 2025-02-26T07:13:11Z

Oh, so we must write a quantization kernel that produces the correct lhs_scales and rhs_scales.

zhipeng93 · 2025-02-26T07:31:34Z

We need a fp8 gemm with 128x1 LHS scaling and 1x128 RHS scaling?

LyricZhao · 2025-02-26T08:14:23Z

We provide this library mainly for inference. So this library only supports DGRAD, not WGRAD.

In my understanding, WGRAD support needs more than a GEMM kernel, but also some utility fused kernels (e.g. transposing, fused with casting, fused with SwiGLU, fused with MoE layout). We want this library to be clean, so we didn't open-source them.

We may later release the WGRAD kernel, we will discuss about it internally :)

YouJiacheng · 2025-02-26T08:20:57Z

Thank you Chenggang!
Yup I forgot that WGRAD needs to transpose matrices.

thefacetakt · 2025-02-27T11:52:43Z

We may later release the WGRAD kernel, we will discuss about it internally :)

@LyricZhao, Could you hint at the approach you are taking for now?

Since it seems infeasible to perform (128, 1) x (1, 128) block-quantized gemm (because hopper tensor cores do not support scales), I am guessing you are using requantization of some kind. Do you use (1, 128) x (128,128) blocks for wgrad or (1, 128) x (128, 1)?

soundOfDestiny · 2025-02-28T09:16:01Z

We may later release the WGRAD kernel, we will discuss about it internally :)

@LyricZhao, Could you hint at the approach you are taking for now?

Since it seems infeasible to perform (128, 1) x (1, 128) block-quantized gemm (because hopper tensor cores do not support scales), I am guessing you are using requantization of some kind. Do you use (1, 128) x (128,128) blocks for wgrad or (1, 128) x (128, 1)?

(128, 128) block scale is only used in weight matrix and won't be used in activation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is DeepGEMM directly applicable to backward in training? #10

Is DeepGEMM directly applicable to backward in training? #10

YouJiacheng commented Feb 26, 2025

YouJiacheng commented Feb 26, 2025

zhipeng93 commented Feb 26, 2025

LyricZhao commented Feb 26, 2025

YouJiacheng commented Feb 26, 2025

thefacetakt commented Feb 27, 2025

soundOfDestiny commented Feb 28, 2025

Is DeepGEMM directly applicable to backward in training? #10

Is DeepGEMM directly applicable to backward in training? #10

Comments

YouJiacheng commented Feb 26, 2025

YouJiacheng commented Feb 26, 2025

zhipeng93 commented Feb 26, 2025

LyricZhao commented Feb 26, 2025

YouJiacheng commented Feb 26, 2025

thefacetakt commented Feb 27, 2025

soundOfDestiny commented Feb 28, 2025