Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM jobs tracking issue #14

Closed
avnikonenko opened this issue Dec 5, 2024 · 7 comments · Fixed by #15
Closed

SLURM jobs tracking issue #14

avnikonenko opened this issue Dec 5, 2024 · 7 comments · Fixed by #15

Comments

@avnikonenko
Copy link

avnikonenko commented Dec 5, 2024

Hello!
Thank you for the great tool!
I have an issue with tracking jobs submitted via SLURM, I created the input dir, copied all provided example input files from a3fe/a3fe/data/example_run_dir into it and saved the code from the documentation into calc.py:

import a3fe as a3
calc = a3.Calculation(ensemble_size=5)
calc.setup()
calc.get_optimal_lam_vals()
calc.run(adaptive=False, runtime = 5) # Run non-adaptively for 5 ns per replicate
calc.wait()
calc.set_equilibration_time(1) # Discard the first ns of simulation time
calc.analyse()
calc.save()
a3fe                      0.2.0                    pypi_0    pypi
gromacs                   2024.4          mpi_openmpi_cuda_he6b8466_0    conda-forge
cat input/run_somd.sh
#SBATCH -o somd-array-gpu-%A.%a.out
#SBATCH --job-name a3fe
#SBATCH --partition qgpu
#SBATCH --nodes 1
#SBATCH --gpus 1

source activate a3fe
lam=$1
echo "lambda is: " $lam

srun somd-freenrg -C somd.cfg -l $lam -p CUDA

I run the script inside the SLURM Interactive Job node (maybe it is the source of the issue, but I cannot run it otherwise).

python calc.py 
INFO:numexpr.utils:Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO:numexpr.utils:Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO:numexpr.utils:NumExpr defaulting to 16 threads.
INFO - 2024-12-05 18:17:58,239 - Calculation_0 - Found all required input files for preparation stage parameterised
INFO - 2024-12-05 18:17:58,241 - Calculation_0 - Modifying/ creating legs
INFO - 2024-12-05 18:17:58,241 - Calculation_0 - Setting up bound leg...
Loading previous Leg. Any arguments will be overwritten...
Setting up logging...
INFO - 2024-12-05 18:17:58,243 - Leg (type = BOUND)_2 - Setting up leg...
INFO - 2024-12-05 18:17:58,245 - Leg (type = BOUND)_2 - Creating stage input directories...
INFO - 2024-12-05 18:17:58,246 - Leg (type = BOUND)_2 - Solvating input structure. Submitting through SLURM...
INFO:VirtualQueue:Job (virtual_job_id = 0, slurm_job_id= None), status = JobStatus.QUEUED submitted
INFO - 2024-12-05 18:17:58,278 - Leg (type = BOUND)_2 - Submitted job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED
INFO - 2024-12-05 18:17:58,290 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:18:29,447 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:18:59,462 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:19:29,474 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:19:59,489 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:20:29,502 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:20:59,517 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:21:29,529 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:21:59,570 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:22:29,583 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:22:59,599 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:23:29,611 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 18:23:59,626 - Leg (type = BOUND)_2 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1970700), status = JobStatus.QUEUED to complete

And here is a problem, Slurm job with 1970700 ID has been finished, but the script is still waiting for it and doesn't stop until I forcefully stop it.
The output of the SLURM 1970700 JOB:

 cat input/somd-array-gpu-1970700.4294967294.out 
INFO:numexpr.utils:Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO:numexpr.utils:Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO:numexpr.utils:NumExpr defaulting to 16 threads.
Loading parameterised system...
Determining optimal rhombic dodecahedral box...
Excluding 3 waters that are over 10 A from the protein
Solvating system with tip3p water and 0.15 M NaCl...
Saving solvated system

Could you help me please?

@fjclark
Copy link
Collaborator

fjclark commented Dec 5, 2024

Hi!

Thanks! Thanks for the detailed bug report, and sorry it isn't working.

I've tried to reproduce on our cluster by running calc.py from an interactive slurm job, but it's working for me.

The code is getting stuck here. To help debug this, could you please:

  • Go to a3fe/a3fe/run/_virtual_queue.py and uncomment these lines. Please also add the line:
self._logger.info(f"Running slurm job ids: {running_slurm_job_ids}")

here.

  • Reinstall a3fe with python -m pip install --no-deps -e .
  • Rerun your calculation, making sure to delete the bound directory, Calculation.log, and Calculation.pkl beforehand
  • Wait until your calculation stalls again, then paste the output of bound/virtual_queue.log here. Hopefully that will make the issue clearer.

Thanks!

@avnikonenko
Copy link
Author

Thank you for the response!

INFO - 2024-12-05 23:53:45,698 - Calculation_0 - Found all required input files for preparation stage parameterised
INFO - 2024-12-05 23:53:45,701 - Calculation_0 - Modifying/ creating legs
INFO - 2024-12-05 23:53:45,701 - Calculation_0 - Setting up bound leg...
INFO - 2024-12-05 23:53:45,703 - Leg (type = BOUND)_1 - Found all required input files for preparation stage solvated
INFO - 2024-12-05 23:53:45,706 - Leg (type = BOUND)_1 - Setting up leg...
INFO - 2024-12-05 23:53:45,710 - Leg (type = BOUND)_1 - Creating stage input directories...
INFO - 2024-12-05 23:53:45,714 - Leg (type = BOUND)_1 - Minimising input structure. Submitting through SLURM...
INFO - 2024-12-05 23:53:45,718 - VirtualQueue - Job (virtual_job_id = 0, slurm_job_id= None), status = JobStatus.QUEUED submitted
INFO - 2024-12-05 23:53:45,736 - VirtualQueue - Running slurm job ids: [1973245]
INFO - 2024-12-05 23:53:45,758 - VirtualQueue - Queue updated
INFO - 2024-12-05 23:53:45,758 - VirtualQueue - Slurm queue slurm job ids: [1973312]
INFO - 2024-12-05 23:53:45,758 - VirtualQueue - Slurm queue virtual job ids: [0]
INFO - 2024-12-05 23:53:45,758 - VirtualQueue - Pre-queue virtual job ids: []
INFO - 2024-12-05 23:53:45,758 - Leg (type = BOUND)_1 - Submitted job Job (virtual_job_id = 0, slurm_job_id= 1973312), status = JobStatus.QUEUED
INFO - 2024-12-05 23:53:45,774 - VirtualQueue - Running slurm job ids: [1973245, 1973312]
INFO - 2024-12-05 23:53:45,774 - VirtualQueue - Queue updated
INFO - 2024-12-05 23:53:45,774 - VirtualQueue - Slurm queue slurm job ids: [1973312]
INFO - 2024-12-05 23:53:45,774 - VirtualQueue - Slurm queue virtual job ids: [0]
INFO - 2024-12-05 23:53:45,774 - VirtualQueue - Pre-queue virtual job ids: []
INFO - 2024-12-05 23:53:45,774 - Leg (type = BOUND)_1 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1973312), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 23:54:15,791 - VirtualQueue - Running slurm job ids: [1973245, 1973312]
INFO - 2024-12-05 23:54:15,792 - VirtualQueue - Queue updated
INFO - 2024-12-05 23:54:15,792 - VirtualQueue - Slurm queue slurm job ids: [1973312]
INFO - 2024-12-05 23:54:15,792 - VirtualQueue - Slurm queue virtual job ids: [0]
INFO - 2024-12-05 23:54:15,792 - VirtualQueue - Pre-queue virtual job ids: []
INFO - 2024-12-05 23:54:15,792 - Leg (type = BOUND)_1 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1973312), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 23:54:45,807 - VirtualQueue - Running slurm job ids: [1973245, 1973312]
INFO - 2024-12-05 23:54:45,808 - VirtualQueue - Queue updated
INFO - 2024-12-05 23:54:45,808 - VirtualQueue - Slurm queue slurm job ids: [1973312]
INFO - 2024-12-05 23:54:45,808 - VirtualQueue - Slurm queue virtual job ids: [0]
INFO - 2024-12-05 23:54:45,808 - VirtualQueue - Pre-queue virtual job ids: []
INFO - 2024-12-05 23:54:45,808 - Leg (type = BOUND)_1 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1973312), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 23:55:15,824 - VirtualQueue - Running slurm job ids: [1973245, 1973312]
INFO - 2024-12-05 23:55:15,825 - VirtualQueue - Queue updated
INFO - 2024-12-05 23:55:15,825 - VirtualQueue - Slurm queue slurm job ids: [1973312]
INFO - 2024-12-05 23:55:15,825 - VirtualQueue - Slurm queue virtual job ids: [0]
INFO - 2024-12-05 23:55:15,825 - VirtualQueue - Pre-queue virtual job ids: []
INFO - 2024-12-05 23:55:15,825 - Leg (type = BOUND)_1 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1973312), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 23:55:45,842 - VirtualQueue - Running slurm job ids: [1973245, 1973312]
INFO - 2024-12-05 23:55:45,842 - VirtualQueue - Queue updated
INFO - 2024-12-05 23:55:45,842 - VirtualQueue - Slurm queue slurm job ids: [1973312]
INFO - 2024-12-05 23:55:45,842 - VirtualQueue - Slurm queue virtual job ids: [0]
INFO - 2024-12-05 23:55:45,842 - VirtualQueue - Pre-queue virtual job ids: []
INFO - 2024-12-05 23:55:45,842 - Leg (type = BOUND)_1 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1973312), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 23:56:15,859 - VirtualQueue - Running slurm job ids: [1973245, 1973312]
INFO - 2024-12-05 23:56:15,860 - VirtualQueue - Queue updated
INFO - 2024-12-05 23:56:15,860 - VirtualQueue - Slurm queue slurm job ids: [1973312]
INFO - 2024-12-05 23:56:15,860 - VirtualQueue - Slurm queue virtual job ids: [0]
INFO - 2024-12-05 23:56:15,860 - VirtualQueue - Pre-queue virtual job ids: []
INFO - 2024-12-05 23:56:15,860 - Leg (type = BOUND)_1 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1973312), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 23:56:45,876 - VirtualQueue - Running slurm job ids: [1973245, 1973312]
INFO - 2024-12-05 23:56:45,876 - VirtualQueue - Queue updated
INFO - 2024-12-05 23:56:45,876 - VirtualQueue - Slurm queue slurm job ids: [1973312]
INFO - 2024-12-05 23:56:45,876 - VirtualQueue - Slurm queue virtual job ids: [0]
INFO - 2024-12-05 23:56:45,876 - VirtualQueue - Pre-queue virtual job ids: []
INFO - 2024-12-05 23:56:45,876 - Leg (type = BOUND)_1 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1973312), status = JobStatus.QUEUED to complete
INFO - 2024-12-05 23:57:15,893 - VirtualQueue - Running slurm job ids: [1973245, 1973312]
INFO - 2024-12-05 23:57:15,894 - VirtualQueue - Queue updated
INFO - 2024-12-05 23:57:15,894 - VirtualQueue - Slurm queue slurm job ids: [1973312]
INFO - 2024-12-05 23:57:15,894 - VirtualQueue - Slurm queue virtual job ids: [0]
INFO - 2024-12-05 23:57:15,894 - VirtualQueue - Pre-queue virtual job ids: []
INFO - 2024-12-05 23:57:15,894 - Leg (type = BOUND)_1 - Waiting for job Job (virtual_job_id = 0, slurm_job_id= 1973312), status = JobStatus.QUEUED to complete

the script running node - 1973245.
Around 23:55 the job has been finished, but the script continued to run.

@fjclark
Copy link
Collaborator

fjclark commented Dec 6, 2024

Thanks.

As the job finished before 23:57, but the output of _read_slurm_queue was 1973245, 1973312 at 23:57:15 (INFO - 2024-12-05 23:57:15,893 - VirtualQueue - Running slurm job ids: [1973245, 1973312]), the issue seems to be that your jobs remain visible with squeue once they're complete, by my code naively assumes that any job visible in the slurm queue is queued or running.

Is this right? I assume the completed job remains in the queue for a while with COMPLETED status.

At the moment, I'm naively using

  commands = [
                ["squeue", "-h", "-u", _getpass.getuser()],
                ["awk", "{print $1}"],
                ["grep", "-v", "-E", "'\\[|_'"],
                ["paste", "-s", "-d,", "-"],
            ]

to read all jobs in the slurm queue, which are assumed to be running.

I'll come up with a more robust way of doing this which accounts for completed jobs.

@fjclark
Copy link
Collaborator

fjclark commented Dec 6, 2024

If this is this case, it should be fixed in this branch: https://github.com/michellab/a3fe/tree/bugfix-robust-slurm-queue-read . I've simply updated to

            # Only read running, pending, suspended, and completing jobs (R, PD, S, CG).
            commands = [
                ["squeue", "-h", "-u", _getpass.getuser(), "-t", "R,PD,S,CG"],
                ["awk", "{print $1}"],
                ["grep", "-v", "-E", "'\\[|_'"],
                ["paste", "-s", "-d,", "-"],
            ]

If this is the case, could you please pull the latest changes, check out the branch bugfix-robust-slurm-queue-read and pip install, then check that this fixes your issue?

Thanks!

@avnikonenko
Copy link
Author

Yes, it is right, all finished jobs in our system stay for some time in the queue.
It works!
Thank you a lot!

@fjclark
Copy link
Collaborator

fjclark commented Dec 6, 2024

Brilliant, glad it works! No problem.

@fjclark
Copy link
Collaborator

fjclark commented Dec 6, 2024

(Reopening so I can formally close my merging in the bugfix-robust-slurm-queue-read branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants