Memory Optimization with Liger Kernel Shows Limited Effect on larger Model （more than 7B） #517

dyyoungg · 2025-01-08T04:44:49Z

I have been using the Liger Kernel to replace standard operators to train Qwen25 model series with deepspeed ZERO3 strategy.
It significantly reduces memory usage on a 7B model（about 36%）, however，it shows limited memory saving (about 6%) on a 14B model.

Questions:

Are there known limitations in Liger Kernel optimizations for larger models like 14B?
Is there any recommended configuration or parameter adjustment to improve memory efficiency for larger models?

DandinPower · 2025-01-09T06:01:39Z

Hi @dyyoungg,
I’m curious about this issue and wanted to share some insights based on my past experience.

In a similar scenario, I encountered a memory spike during the optimizer step while using the PyTorch Adam optimizer with the default setting foreach=True. This happened because it required a copy of the model weights. I wonder if your situation involves a similar memory peak that isn’t solely caused by the cross-entropy logits peak during training.
Even if the cross-entropy logits peak is mitigated by enabling the Liger Kernel, other memory bottlenecks might still be contributing to the overall peak memory usage, which could explain why the reduction isn’t significant.

Could you provide more details about your setup? I can then try to reproduce it and perform an analysis. Specifically:

How many GPUs are you using? (ZeRO-3 partitions memory differently depending on the number of GPUs, which can significantly impact results.)
What is the micro-batch size per GPU and the context length? (These parameters greatly affect activation memory usage.)
Are you using other memory-efficient techniques, such as gradient checkpointing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Optimization with Liger Kernel Shows Limited Effect on larger Model （more than 7B） #517

Memory Optimization with Liger Kernel Shows Limited Effect on larger Model （more than 7B） #517

dyyoungg commented Jan 8, 2025

DandinPower commented Jan 9, 2025

Memory Optimization with Liger Kernel Shows Limited Effect on larger Model （more than 7B） #517

Memory Optimization with Liger Kernel Shows Limited Effect on larger Model （more than 7B） #517

Comments

dyyoungg commented Jan 8, 2025

DandinPower commented Jan 9, 2025