You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I do not know how to fix Ray OOM issues. The issue occurs after the epoch 0 step 20 (20 is the interval of evaluation). could you help me that how to fix this bug. I show the my hyper-parameters and error info below. I employed a machine of 8 H100 80GB GPUs and 1600GB memory.
below is my error info
FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .�[32m [repeated 5x across cluster]�[0m
�[36m(WorkerDict pid=3549567)�[0m warnings.warn(�[32m [repeated 5x across cluster]�[0m
�[36m(main_task pid=3545219)�[0m /hpctmp/e1143641/TinyZero/verl/trainer/ppo/ray_trainer.py:446: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
�[36m(main_task pid=3545219)�[0m reward_tensor = torch.tensor(reward_metrics['reward']) # self.val_reward_fn(test_batch)
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=data/scierc/train_simply_prompt.parquet', 'data.val_files=data/scierc/test_simply_prompt.parquet', 'data.train_batch_size=32', 'data.val_batch_size=32', 'data.max_prompt_length=512', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=hf_models/qwen2.5-7b-instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=4', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=True', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=4', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.8', 'actor_rollout_ref.rollout.n=2', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=wandb', 'trainer.project_name=TinyZero', 'trainer.experiment_name=qwen2.5_7b_grpo_training_rollingout_4_zero', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=800', 'trainer.test_freq=20', 'trainer.total_epochs=15']
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/hpctmp/e1143641/TinyZero/verl/trainer/main_ppo.py", line 320, in<module>main()
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
returnfunc()
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in<lambda>
lambda: hydra.run(
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/hpctmp/e1143641/TinyZero/verl/trainer/main_ppo.py", line 228, in main
ray.get(main_task.remote(config))
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 921, in get_objects
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.12.104, ID: 760ca79c45fce78ca830e235a13d4d933bd855c576cef5b1f020970e) where the task (task ID: f8089076beec7eecbbdbe086695caee07220fd6301000000, name=main_task, pid=3545219, memory used=0.54GB) was running was 1925.14GB / 2015.37GB (0.955227), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: fc55ea58525a03ea6dab2368ae889ea6ee0c1ae76f90c00d5d9563af) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.12.104`. To see the logs of the worker, use `ray logs worker-fc55ea58525a03ea6dab2368ae889ea6ee0c1ae76f90c00d5d9563af*out -ip 192.168.12.104. Top 10 memory users:PID MEM(GB) COMMAND3549568 119.13 ray::WorkerDict.actor_rollout_generate_sequences3549565 118.93 ray::WorkerDict.actor_rollout_generate_sequences3549567 117.66 ray::WorkerDict.actor_rollout_generate_sequences3549564 117.53 ray::WorkerDict.actor_rollout_generate_sequences3549566 117.14 ray::WorkerDict.actor_rollout_generate_sequences3549562 115.74 ray::WorkerDict.actor_rollout_generate_sequences3549563 113.32 ray::WorkerDict.actor_rollout_generate_sequences3546151 112.60 ray::WorkerDict.actor_rollout_generate_sequences3532996 0.68 /home/svu/e1143641/.local/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/va...3545219 0.54 ray::main_taskRefer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.INFO: Terminating squashfuse_ll after timeoutINFO: Timeouts can be caused by a running background process
I do not know how to fix Ray OOM issues. The issue occurs after the epoch 0 step 20 (20 is the interval of evaluation). could you help me that how to fix this bug. I show the my hyper-parameters and error info below. I employed a machine of 8 H100 80GB GPUs and 1600GB memory.
below is my error info
Below is my hyper parameters.
The text was updated successfully, but these errors were encountered: