The AI4City lab has bought linux servers to be used in scientific research, mainly for the staff and PhD students.
Ask admin to create an account for you.
Account creation and maintenance: Juran, juranzhang at hkust-gz.edu.cn
--- Make sure you are connected to Campus wifi or VPN ---
Ssh into the server by {your account}@10.120.17.95 and key in the initial password
ssh {your account}@10.120.17.95
Key in the initial password that admin shared with you.
Alternatively, you can access the server with gui interface. Simply install Microsoft Remote Desktop (Mac), or open Remote Desktop Connection (Windows), then setup the PC/Computer Name to be 10.120.17.95:3389 and key in your username and password.
Reset the password:
passwd {your account}
top
nvidia-smi
The server has cuda drivers installed, but for DL work you need to install your own virtual environments with the appropriate packages (as some codes require specific versions of the cuda and other libraries). This is assumed to be common knowledge and expected to be done by the user (aided by his/her thesis supervisor). For beginners, it is advised to stick to the common and 'ready' solutions, e.g. pytorch stable releases.
Use scp
scp [email protected]:home/{your account}/foobar.txt /some/local/directory
Or install and use syncthing on your laptop TODO: install syncthing on server
This enables your to version your work and share with coworkers.
Slurm commands could be confusing at first. If you check slurm documentation you could easily be overwhelmed as there are hundreds of commands + options. We go through one of the key commands and options that most people would need to complete their daily tasks.
In the repo, we've included a simple_torch.py script that does some simple GPU computing.
Usually, we can run python3 simple_torch.py
to do the job.
However, to facilitate GPU resources better, we will need to submit the job through slurm.
TODO: prevent all non-slurm GPU usage. Issue #1
To run simple_torch.py with slurm, create a script.
In our sample, the script is named gpu_tester.sh
To submit the job to slurm
sbatch gpu_tester.sh
Notice a GPU is allocated with nvidia-smi
To check status of CPU, Mem, and GPU usage on a node
sinfo -o "%10c %20m %30G"
Script output would be under the same folder - eg,slurm-36.out
To check what's queued in slurm, run squeue
Run sbatch multiple times, you will see multiple GPUs in use.
To cancel a run, simply run scancel {jobId}
There is plenty of guides to more slurm commands, eg,https://svante.mit.edu/use_slurm.html
These are useful when sinfo
does not work or respond.
sinfo -R
shows debug information.
This command debugs the slurm controller.
sudo slurmctld -D
This command debugs the slurm client. Note in our scenario both client and the controller resides in the same node/server.
sudo slurmd -D
Sample output of sinfo
when everything is up
(base) bld@bld:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle bld
Sample output of sinfo
when some tasks are running
(base) bld@bld:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 mix bld
This command helps you to restart the node when a node is drained (underresourced or else).
sudo scontrol update nodename=bld state=idle