[BUG] Using fp16 uses more memory than using fp32 #1349

eliird · 2025-01-08T03:06:08Z

Describe the bug
Using fp16 or bf16 uses more memory than using fp32

To Reproduce
Here are the training parameters I am using to train the model. When I comment out the --fp16, the memory usage increases.
My setup 8xH100.

GPT_MODEL_ARGS=(
    --num-layers 32
    --hidden-size 4096
    --num-attention-heads 32
    --seq-length 4096
    --no-position-embedding
    --no-masked-softmax-fusion
    --use-rotary-position-embeddings
    --max-position-embeddings 8192
    --attention-dropout 0
    --hidden-dropout 0
    --normalization RMSNorm
    --ffn-hidden-size 14336
    --num-query-groups 8
    --swiglu
    --group-query-attention
    --tokenizer-type HuggingFaceTokenizer
    # --untie-embeddings-and-output-weights
    --position-embedding-type rope
    --disable-bias-linear
    --tokenizer-model $TOKENIZER_SAVE_PATH
)

TRAINING_ARGS=(
    --micro-batch-size $MICRO_BATCH_SIZE
    --global-batch-size $GLOBAL_BATCH_SIZE
    --train-iters 500000
    --weight-decay 0.1
    --adam-beta1 0.9
    --adam-beta2 0.95
    --init-method-std 0.006
    --clip-grad 1.0
    --fp16 # disabling this parameter should use fp32, and it reduces memory usage.
    --lr 6.0e-5
    --lr-decay-style cosine
    --min-lr 6.0e-6
    --lr-warmup-fraction .001
    --lr-decay-iters 430000
    --optimizer sgd
    --empty-unused-memory-level 2
    --recompute-granularity "full"
    --recompute-method uniform
    --recompute-num-layers 1
    --transformer-impl "transformer_engine"

)

MODEL_PARALLEL_ARGS=(
     --tensor-model-parallel-size 8
    --pipeline-model-parallel-size 1
    --sequence-parallel
)

DATA_ARGS=(
    --data-path $DATA_PATH
    --split 949,50,1
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 10
    --save-interval 10000
    --eval-interval 1000
    --save $CHECKPOINT_SAVE_PATH
    # --load $CHECKPOINT_LOAD_PATH
    --eval-iters 10
    --tensorboard-dir $TENSORBOARD_LOGS_PATH
    --log-throughput
)

python pretrain_gpt.py \
    ${GPT_MODEL_ARGS[@]} \
    ${TRAINING_ARGS[@]} \
    ${MODEL_PARALLEL_ARGS[@]} \
    ${DATA_ARGS[@]} \
    ${EVAL_AND_LOGGING_ARGS[@]}

Expected behavior
FP16 should use less memory than that of FP32

Stack trace/logs
FP16 MEMORY USAGE

FP32 MEMORY USAGE

Environment (please complete the following information):

Megatron-LM commit ID 1ce944c
PyTorch version 2.4
CUDA version 12.1
NCCL version

The text was updated successfully, but these errors were encountered:

eliird · 2025-01-08T09:10:36Z

I tried looking at the internal code of loading the model and it seems that model is moved to GPU and then converted to fp16, would that not consume more memory when the model is being loaded. Probably has nothing to do with the used memory but still...

Megatron-LM/megatron/training/training.py line 535

#  GPU allocation.
 for model_module in model:
      model_module.cuda(torch.cuda.current_device())

  # Fp16 conversion.
  if args.fp16 or args.bf16:
      model = [Float16Module(model_module, args) for model_module in model]

eliird · 2025-01-08T09:21:37Z

I am still trying to look throught he code but the main difference is the fp16 optimizer has groups with both fp32 and fp16 parameters, probably somewhere duplicate memory is being used or something, will try to investigate a bit more but some feedback on this would be appreciated, especially if someone can confirm their memory usage also increases for fp16

eliird · 2025-01-08T09:35:44Z

Maybe the cause for the increased memory is the parameter being detached and cloned iin the initialization of the FP16Optimizer class. ~~I am adding the snippet of the code below~~ probably better to refer to the code. I will do some profiling later.

Megatron-LM/megatron/core/optimizer/optimizer.py

Line 550 in 1ce944c

main_param = param.detach().clone().float()

eliird changed the title ~~[BUG]~~ [BUG] Using fp16 uses more memory than using fp32 Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using fp16 uses more memory than using fp32 #1349

[BUG] Using fp16 uses more memory than using fp32 #1349

eliird commented Jan 8, 2025 •

edited

Loading

eliird commented Jan 8, 2025 •

edited

Loading

eliird commented Jan 8, 2025

eliird commented Jan 8, 2025 •

edited

Loading

[BUG] Using fp16 uses more memory than using fp32 #1349

[BUG] Using fp16 uses more memory than using fp32 #1349

Comments

eliird commented Jan 8, 2025 • edited Loading

eliird commented Jan 8, 2025 • edited Loading

eliird commented Jan 8, 2025

eliird commented Jan 8, 2025 • edited Loading

eliird commented Jan 8, 2025 •

edited

Loading

eliird commented Jan 8, 2025 •

edited

Loading

eliird commented Jan 8, 2025 •

edited

Loading