Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Optimization with Liger Kernel Shows Limited Effect on larger Model (more than 7B) #517

Open
dyyoungg opened this issue Jan 8, 2025 · 1 comment

Comments

@dyyoungg
Copy link

dyyoungg commented Jan 8, 2025

I have been using the Liger Kernel to replace standard operators to train Qwen25 model series with deepspeed ZERO3 strategy.
It significantly reduces memory usage on a 7B model(about 36%), however,it shows limited memory saving (about 6%) on a 14B model.

Questions:

  1. Are there known limitations in Liger Kernel optimizations for larger models like 14B?
  2. Is there any recommended configuration or parameter adjustment to improve memory efficiency for larger models?
@DandinPower
Copy link
Contributor

Hi @dyyoungg,
I’m curious about this issue and wanted to share some insights based on my past experience.

In a similar scenario, I encountered a memory spike during the optimizer step while using the PyTorch Adam optimizer with the default setting foreach=True. This happened because it required a copy of the model weights. I wonder if your situation involves a similar memory peak that isn’t solely caused by the cross-entropy logits peak during training.
Even if the cross-entropy logits peak is mitigated by enabling the Liger Kernel, other memory bottlenecks might still be contributing to the overall peak memory usage, which could explain why the reduction isn’t significant.

Could you provide more details about your setup? I can then try to reproduce it and perform an analysis. Specifically:

  1. How many GPUs are you using? (ZeRO-3 partitions memory differently depending on the number of GPUs, which can significantly impact results.)
  2. What is the micro-batch size per GPU and the context length? (These parameters greatly affect activation memory usage.)
  3. Are you using other memory-efficient techniques, such as gradient checkpointing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants