Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy and test SLURM setup #1

Open
noamross opened this issue Jul 5, 2023 · 11 comments
Open

Deploy and test SLURM setup #1

noamross opened this issue Jul 5, 2023 · 11 comments
Assignees
Labels
test prototype test

Comments

@noamross
Copy link

noamross commented Jul 5, 2023

No description provided.

@noamross
Copy link
Author

noamross commented Jul 5, 2023

@espirado
Copy link
Contributor

@n8layman @collinschwantes I have updated the README for testing purposes. Could you test and create issues on your findings/suggestions.

@n8layman
Copy link

n8layman commented Sep 6, 2023

From my conversation with @espirado today I won't be able to fork the repo and test the container on my local machine due to incompatibilities with ARM architecture. We're working on setting up a VM so I can remotely test the SLURM workflow.

@espirado
Copy link
Contributor

Since the code change to deploy slurm based reservior is huge will be creating a separate repository for the new code from the eha-server.

@espirado
Copy link
Contributor

Deployed the first attempt on Aegypti got the containers to run but due to hardware incompatibility for GPU and most drivers had to remove GPU. Hopefully will have the usable access tomorrow

@espirado
Copy link
Contributor

espirado commented Dec 4, 2023

@n8layman can we schedule a test session for slurm I have it set up on aegypti. Can you first try to access via
ssh -p 22022 [email protected] ( your normal username and password). This will give you access to the controller. The web interface will be up when the dns resolves to the domain name after a few days. For now we can test via CLI access.

@espirado espirado added the test prototype test label Dec 4, 2023
@n8layman
Copy link

n8layman commented Dec 4, 2023

Sounds great. I can access the controller using ssh as above.

@espirado
Copy link
Contributor

espirado commented Dec 6, 2023

We successfully did a run test on the base slurm environment and worked as expected and will proceed in creating a repo with adequate code examples to integrate various R/Python workflows with slurm for different types of workloads and efficient cluster usage. @n8layman will also assist in coming up with examples that we can use for M3 .

@noamross
Copy link
Author

noamross commented Dec 6, 2023

Excellent!

@espirado
Copy link
Contributor

Notes from review

  1. Are the workers exclusive - will they reserve that amount of hardware? Can you exceed the number of CPUs on the machine?
  2. Can you dynamically set the resources used? Yes
  3. How does prioritization change the run? SLURM admin can configure a partition to have a priority value - submitting a job to a partition with a higher priority wins — user groups can be prioritized
  4. Handling Authentication?
  • Standard partition - bigger than a laptop 8 cores and 16 gbs —> run one R script to the standard partition (standard resources and priority)
  • Can you get errors out to a specific place?
  • Pull example from slurmR?
  • Are defaults for each parameters set by the controller? Yes -
  • Likely smaller 64 gb of ram and 32 cores
  • Want to make sure this will work with targets
  • Using the targets slurm backend to >>> targets are sent to jobs
  • Have 100 single CPU targets
  • These targets need to go the nodes

All issues Raised will be addressed on next merge request and documentation both for workflow examples and Infrastucture deployment.

@espirado
Copy link
Contributor

Added all nodes(Prospero,sycorax.aegypti). Tests for targets working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test prototype test
Projects
None yet
Development

No branches or pull requests

3 participants