You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been using the Liger Kernel to replace standard operators to train Qwen25 model series with deepspeed ZERO3 strategy.
It significantly reduces memory usage on a 7B model(about 36%), however,it shows limited memory saving (about 6%) on a 14B model.
Questions:
Are there known limitations in Liger Kernel optimizations for larger models like 14B?
Is there any recommended configuration or parameter adjustment to improve memory efficiency for larger models?
The text was updated successfully, but these errors were encountered:
Hi @dyyoungg,
I’m curious about this issue and wanted to share some insights based on my past experience.
In a similar scenario, I encountered a memory spike during the optimizer step while using the PyTorch Adam optimizer with the default setting foreach=True. This happened because it required a copy of the model weights. I wonder if your situation involves a similar memory peak that isn’t solely caused by the cross-entropy logits peak during training.
Even if the cross-entropy logits peak is mitigated by enabling the Liger Kernel, other memory bottlenecks might still be contributing to the overall peak memory usage, which could explain why the reduction isn’t significant.
Could you provide more details about your setup? I can then try to reproduce it and perform an analysis. Specifically:
How many GPUs are you using? (ZeRO-3 partitions memory differently depending on the number of GPUs, which can significantly impact results.)
What is the micro-batch size per GPU and the context length? (These parameters greatly affect activation memory usage.)
Are you using other memory-efficient techniques, such as gradient checkpointing?
I have been using the Liger Kernel to replace standard operators to train Qwen25 model series with deepspeed ZERO3 strategy.
It significantly reduces memory usage on a 7B model(about 36%), however,it shows limited memory saving (about 6%) on a 14B model.
Questions:
The text was updated successfully, but these errors were encountered: