Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray OOM #74

Open
aivolcano opened this issue Feb 16, 2025 · 0 comments
Open

Ray OOM #74

aivolcano opened this issue Feb 16, 2025 · 0 comments

Comments

@aivolcano
Copy link

I do not know how to fix Ray OOM issues. The issue occurs after the epoch 0 step 20 (20 is the interval of evaluation). could you help me that how to fix this bug. I show the my hyper-parameters and error info below. I employed a machine of 8 H100 80GB GPUs and 1600GB memory.

below is my error info

FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .�[32m [repeated 5x across cluster]�[0m
�[36m(WorkerDict pid=3549567)�[0m   warnings.warn(�[32m [repeated 5x across cluster]�[0m
�[36m(main_task pid=3545219)�[0m /hpctmp/e1143641/TinyZero/verl/trainer/ppo/ray_trainer.py:446: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
�[36m(main_task pid=3545219)�[0m   reward_tensor = torch.tensor(reward_metrics['reward']) # self.val_reward_fn(test_batch)
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=data/scierc/train_simply_prompt.parquet', 'data.val_files=data/scierc/test_simply_prompt.parquet', 'data.train_batch_size=32', 'data.val_batch_size=32', 'data.max_prompt_length=512', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=hf_models/qwen2.5-7b-instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=4', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=True', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=4', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.8', 'actor_rollout_ref.rollout.n=2', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=wandb', 'trainer.project_name=TinyZero', 'trainer.experiment_name=qwen2.5_7b_grpo_training_rollingout_4_zero', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=800', 'trainer.test_freq=20', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/hpctmp/e1143641/TinyZero/verl/trainer/main_ppo.py", line 320, in <module>
    main()
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/hpctmp/e1143641/TinyZero/verl/trainer/main_ppo.py", line 228, in main
    ray.get(main_task.remote(config))
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 921, in get_objects
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.12.104, ID: 760ca79c45fce78ca830e235a13d4d933bd855c576cef5b1f020970e) where the task (task ID: f8089076beec7eecbbdbe086695caee07220fd6301000000, name=main_task, pid=3545219, memory used=0.54GB) was running was 1925.14GB / 2015.37GB (0.955227), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: fc55ea58525a03ea6dab2368ae889ea6ee0c1ae76f90c00d5d9563af) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.12.104`. To see the logs of the worker, use `ray logs worker-fc55ea58525a03ea6dab2368ae889ea6ee0c1ae76f90c00d5d9563af*out -ip 192.168.12.104. Top 10 memory users:
PID	MEM(GB)	COMMAND
3549568	119.13	ray::WorkerDict.actor_rollout_generate_sequences
3549565	118.93	ray::WorkerDict.actor_rollout_generate_sequences
3549567	117.66	ray::WorkerDict.actor_rollout_generate_sequences
3549564	117.53	ray::WorkerDict.actor_rollout_generate_sequences
3549566	117.14	ray::WorkerDict.actor_rollout_generate_sequences
3549562	115.74	ray::WorkerDict.actor_rollout_generate_sequences
3549563	113.32	ray::WorkerDict.actor_rollout_generate_sequences
3546151	112.60	ray::WorkerDict.actor_rollout_generate_sequences
3532996	0.68	/home/svu/e1143641/.local/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/va...
3545219	0.54	ray::main_task
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
INFO:    Terminating squashfuse_ll after timeout
INFO:    Timeouts can be caused by a running background process

Below is my hyper parameters.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=data/scierc/train_simply_prompt.parquet \
    data.val_files=data/scierc/test_simply_prompt.parquet \
    data.train_batch_size=32 \
    data.val_batch_size=32 \
    data.max_prompt_length=512 \
    data.max_response_length=2048 \
    actor_rollout_ref.model.path=hf_models/qwen2.5-7b-instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=4 \
    actor_rollout_ref.actor.ppo_micro_batch_size=4 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.grad_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    actor_rollout_ref.rollout.n=2 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger='wandb' \
    trainer.project_name='TinyZero' \
    trainer.experiment_name='qwen2.5_7b_grpo_training_rollingout_4_zero' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=800 \
    trainer.test_freq=20 \
    trainer.total_epochs=15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant