Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running a3fe on CPU #16

Closed
stanwlodek opened this issue Dec 9, 2024 · 3 comments
Closed

running a3fe on CPU #16

stanwlodek opened this issue Dec 9, 2024 · 3 comments

Comments

@stanwlodek
Copy link

Hi I am Stan Wlodek. I was encouraged by Dr Julien Michel to ask any questions if I have some problems running a3fe on cpu. I am trying to reproduce the results on your simplest system which is t4l with ligand.sdf and protein.pdb as input. My run_somd.sh looks like this:

#!/bin/bash
#SBATCH -o somd-array-cpu-%A.%a.out
#SBATCH -n 8
lam=$1
echo "lambda is: " $lam
srun somd-freenrg -C somd.cfg -l $lam -p CPU

Run seems to be hanging at the setup stage issuing every minute an info:

INFO - 2024-12-09 16:28:32,240 - Leg (type = BOUND)_3 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 17), status = JobStatus.QUEUED to complete

Do you have any suggestion what am I doing wrong?
Thanks

@fjclark
Copy link
Collaborator

fjclark commented Dec 10, 2024

Hi Stan. Thanks for posting this.

I assume that you don't have access to GPUs, but I would strongly recommend running on GPUs if at all possible. The calculations will be slow to the point of being unusable if run on CPUs only.

  • A first guess is that the calculation has reached the slow MD heating and equilibration stages (shown by e.g. INFO - 2024-12-10 11:06:29,816 - Leg (type = BOUND)_1 - Heating and equilibrating. Submitting through SLURM...), and is simply taking a very long time to run on CPUs, rather than GPUs. To rule this out, could you please confirm that the stalled SLURM job has completed and is no longer in the queue (e.g. squeue does not show job ID 17 as running in your example above). If the calculations are simply running slowly on CPUs, I can share a script to drastically reduce the simulation times during setup and production, which might make the T4L calculation more manageable.
  • If squeue shows that the problematic job has finished (but a3fe hasn't realised), please check your version of a3fe with conda list | grep a3fe. Versions before 0.2.1 assumed that any jobs shown by squeue were queued or running, even if their status was COMPLETE. If this is the case, please upgrade to version 0.2.1.

Thanks,
Finlay

@fjclark
Copy link
Collaborator

fjclark commented Jan 19, 2025

Hi Stan, I'm closing this due to inactivity and because it seems likely that the issue is long run times on CPU, rather than an issue with the software. Please feel free to reopen if this isn't the case, or you have any more questions. Thanks.

@fjclark fjclark closed this as completed Jan 19, 2025
@stanwlodek
Copy link
Author

stanwlodek commented Jan 19, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants