-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Both layouts of Grouped GEMMs need to be aligned to the GEMM M block size ? #15
Comments
Yes.
No limitation for the masked kernel, could you please share you modified test script? |
@LyricZhao Hi, thank you for your reply and appreciate your awesome work.
I just modify the two lines on ![]() then run ![]() test machine environment: Nvidia H800、cuda 12.4、torch 2.6.0+cu124 |
It looks well now. Are these limitations necessary to achieve good performance?can we remove them to support any |
Not necessary, they are TMA limitations. If we are using 2D A possible solution is to use 3D TMA |
m_grouped_gemm_fp8_fp8_bf16_nt_contiguous
function documentation says thatwhich means the M of each group must be multiples of 128?
m_grouped_gemm_fp8_fp8_bf16_nt_masked
, but I test several M(8, 16, 64)
that don't meet the requirement ontest/test_core.py
bytest_m_grouped_gemm_masked
and found they can't pass the correctness validation, so does the masked layout grouped gemm also has the limitation implicitly?The text was updated successfully, but these errors were encountered: