up matrix is not involved? #16

yejunguo · 2025-02-24T04:28:54Z

Hi,

very roughly, MLA compresses the input into a latent tensor via DOWN matrix, caches the latent tensor, and then converts the latent tensor back to 'normal' QKVs via UP matrix before SDPA.

looks that FlashMLA does not accept the UP matrix in its parameters, and so the inputs of FlashMLA are 'normal' (MHA/GQA/MQA) QKVs?

imho, FlashMLA is expected to accept MLA caches and UP matrices etc, doing possible matrix absorb together with SDPA.

MacavityT · 2025-02-24T04:50:18Z

Same question, without the input of UP matrix, how can we use the MLA computation skills in DeepSeek-v2 paper?

YLGH · 2025-02-24T05:02:47Z

For MLA the q absorb and o absorb steps can be done separately from the attention.
e.g.
q: [bs, num_q_heads, 128 (head dim)] -> q: [bs, num_q_heads, 512 (latent dim)] concat q_rope: [bs, num_q_heads, 64)]
the output of MLA will be [bs, num_q_heads, 512)], which can then be down_projed independently.

MacavityT · 2025-02-24T05:31:47Z

For MLA the q absorb and o absorb steps can be done separately from the attention. e.g. q: [bs, num_q_heads, 128 (head dim)] -> q: [bs, num_q_heads, 512 (latent dim)] concat q_nope: [bs, num_q_heads, 64)] the output of MLA will be [bs, num_q_heads, 512)], which can then be down_projed independently.

But it seems to only work for models that adopted MLA during the training process. If we need to apply FlashMLA to models like the LLaMA series to accelerate inference decoding, it doesn't seem to make sense.

hhding · 2025-02-24T07:22:36Z

For MLA the q absorb and o absorb steps can be done separately from the attention. e.g. q: [bs, num_q_heads, 128 (head dim)] -> q: [bs, num_q_heads, 512 (latent dim)] concat q_nope: [bs, num_q_heads, 64)] the output of MLA will be [bs, num_q_heads, 512)], which can then be down_projed independently.

But it seems to only work for models that adopted MLA during the training process. If we need to apply FlashMLA to models like the LLaMA series to accelerate inference decoding, it doesn't seem to make sense.

You can convert MHA to MLA: https://arxiv.org/abs/2502.14837 (Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs)

Thomas-MMJ · 2025-02-24T15:55:28Z

Can also convert GQA to MLA

https://github.com/fxmeng/TransMLA
https://arxiv.org/abs/2502.07864

yejunguo · 2025-02-25T01:50:25Z

For MLA the q absorb and o absorb steps can be done separately from the attention. e.g. q: [bs, num_q_heads, 128 (head dim)] -> q: [bs, num_q_heads, 512 (latent dim)] concat q_rope: [bs, num_q_heads, 64)] the output of MLA will be [bs, num_q_heads, 512)], which can then be down_projed independently.

agree that down matrix is independent of FlashMLA, but looks that my concern about inputs of FlashMLA (MLA caches and UP matrices) is still there

jukofyork mentioned this issue Feb 24, 2025

Optimized DeepSeek V2/V3 implementation (MLA) ggml-org/llama.cpp#11446

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

up matrix is not involved? #16

up matrix is not involved? #16

yejunguo commented Feb 24, 2025 •

edited

Loading

MacavityT commented Feb 24, 2025

YLGH commented Feb 24, 2025 •

edited

Loading

MacavityT commented Feb 24, 2025

hhding commented Feb 24, 2025

Thomas-MMJ commented Feb 24, 2025

yejunguo commented Feb 25, 2025

up matrix is not involved? #16

up matrix is not involved? #16

Comments

yejunguo commented Feb 24, 2025 • edited Loading

MacavityT commented Feb 24, 2025

YLGH commented Feb 24, 2025 • edited Loading

MacavityT commented Feb 24, 2025

hhding commented Feb 24, 2025

Thomas-MMJ commented Feb 24, 2025

yejunguo commented Feb 25, 2025

yejunguo commented Feb 24, 2025 •

edited

Loading

YLGH commented Feb 24, 2025 •

edited

Loading