Ray OOM #74

aivolcano · 2025-02-16T06:15:45Z

I do not know how to fix Ray OOM issues. The issue occurs after the epoch 0 step 20 (20 is the interval of evaluation). could you help me that how to fix this bug. I show the my hyper-parameters and error info below. I employed a machine of 8 H100 80GB GPUs and 1600GB memory.

below is my error info

FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .�[32m [repeated 5x across cluster]�[0m
�[36m(WorkerDict pid=3549567)�[0m   warnings.warn(�[32m [repeated 5x across cluster]�[0m
�[36m(main_task pid=3545219)�[0m /hpctmp/e1143641/TinyZero/verl/trainer/ppo/ray_trainer.py:446: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
�[36m(main_task pid=3545219)�[0m   reward_tensor = torch.tensor(reward_metrics['reward']) # self.val_reward_fn(test_batch)
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=data/scierc/train_simply_prompt.parquet', 'data.val_files=data/scierc/test_simply_prompt.parquet', 'data.train_batch_size=32', 'data.val_batch_size=32', 'data.max_prompt_length=512', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=hf_models/qwen2.5-7b-instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=4', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=True', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=4', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.8', 'actor_rollout_ref.rollout.n=2', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=wandb', 'trainer.project_name=TinyZero', 'trainer.experiment_name=qwen2.5_7b_grpo_training_rollingout_4_zero', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=800', 'trainer.test_freq=20', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/hpctmp/e1143641/TinyZero/verl/trainer/main_ppo.py", line 320, in <module>
    main()
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/hpctmp/e1143641/TinyZero/verl/trainer/main_ppo.py", line 228, in main
    ray.get(main_task.remote(config))
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 921, in get_objects
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.12.104, ID: 760ca79c45fce78ca830e235a13d4d933bd855c576cef5b1f020970e) where the task (task ID: f8089076beec7eecbbdbe086695caee07220fd6301000000, name=main_task, pid=3545219, memory used=0.54GB) was running was 1925.14GB / 2015.37GB (0.955227), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: fc55ea58525a03ea6dab2368ae889ea6ee0c1ae76f90c00d5d9563af) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.12.104`. To see the logs of the worker, use `ray logs worker-fc55ea58525a03ea6dab2368ae889ea6ee0c1ae76f90c00d5d9563af*out -ip 192.168.12.104. Top 10 memory users:
PID	MEM(GB)	COMMAND
3549568	119.13	ray::WorkerDict.actor_rollout_generate_sequences
3549565	118.93	ray::WorkerDict.actor_rollout_generate_sequences
3549567	117.66	ray::WorkerDict.actor_rollout_generate_sequences
3549564	117.53	ray::WorkerDict.actor_rollout_generate_sequences
3549566	117.14	ray::WorkerDict.actor_rollout_generate_sequences
3549562	115.74	ray::WorkerDict.actor_rollout_generate_sequences
3549563	113.32	ray::WorkerDict.actor_rollout_generate_sequences
3546151	112.60	ray::WorkerDict.actor_rollout_generate_sequences
3532996	0.68	/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/va...
3545219	0.54	ray::main_task
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
INFO:    Terminating squashfuse_ll after timeout
INFO:    Timeouts can be caused by a running background process

Below is my hyper parameters.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=data/scierc/train_simply_prompt.parquet \
    data.val_files=data/scierc/test_simply_prompt.parquet \
    data.train_batch_size=32 \
    data.val_batch_size=32 \
    data.max_prompt_length=512 \
    data.max_response_length=2048 \
    actor_rollout_ref.model.path=hf_models/qwen2.5-7b-instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=4 \
    actor_rollout_ref.actor.ppo_micro_batch_size=4 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.grad_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.rollout.n=2 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger='wandb' \
    trainer.project_name='TinyZero' \
    trainer.experiment_name='qwen2.5_7b_grpo_training_rollingout_4_zero' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=800 \
    trainer.test_freq=20 \
    trainer.total_epochs=15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray OOM #74

Ray OOM #74

aivolcano commented Feb 16, 2025

Ray OOM #74

Ray OOM #74

Comments

aivolcano commented Feb 16, 2025