-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is DeepGEMM directly applicable to backward in training? #10
Comments
Oh, so we must write a quantization kernel that produces the correct |
We need a fp8 gemm with 128x1 LHS scaling and 1x128 RHS scaling? |
We provide this library mainly for inference. So this library only supports DGRAD, not WGRAD. In my understanding, WGRAD support needs more than a GEMM kernel, but also some utility fused kernels (e.g. transposing, fused with casting, fused with SwiGLU, fused with MoE layout). We want this library to be clean, so we didn't open-source them. We may later release the WGRAD kernel, we will discuss about it internally :) |
Thank you Chenggang! |
@LyricZhao, Could you hint at the approach you are taking for now? Since it seems infeasible to perform (128, 1) x (1, 128) block-quantized gemm (because hopper tensor cores do not support scales), I am guessing you are using requantization of some kind. Do you use (1, 128) x (128,128) blocks for wgrad or (1, 128) x (128, 1)? |
(128, 128) block scale is only used in weight matrix and won't be used in activation. |
bwd of GEMM is two GEMMs, but I wonder if I need to take some special care of the range of gradients?
The text was updated successfully, but these errors were encountered: