To track and add new data inputs (e.g., data/raw
):
dvc add data/raw
dvc push
Note: Only necessary if you want to track new data inputs that are not already declared in the dvc.yaml file as outputs of a stage.
Info: The files added with DVC should be Git-ignored, but adding with DVC will automatically create .gitignore files. What Git tracks are references with a .dvc suffix (e.g. data/raw.dvc). Make sure you add and push the .dvc files to the Git remote.
As your requirements change, always update requirements.txt with your fixed versions:
pip freeze > requirements.txt
Docker images are automatically rebuilt and pushed to Docker Hub by the GitHub workflow when requirements.txt, Dockerfile, or docker_image.yml are updated and pushed to GitHub. If you trigger an image build, ensure it is completed and pushed to Docker Hub before proceeding.
Note: For the free
docker/build-push-action
, there is a 14GB storage limit for free public repositories on GitHub runners (About GitHub runners). Therefore, the Docker image must not exceed this size.
On the HPC cluster, the Docker image is then automatically pulled and converted to a Singularity image when running the slurm_job.sh and no image is found within the repository.
sbatch slurm_job.sh
If you want to force the update of the Singularity image, you can use the flag --rebuild-container
or delete the existing image on the cluster when submitting the job.
sbatch slurm_job.sh --rebuild-container
To run the entire pipeline locally or locally within a docker container, execute the following commands:
# natively
./exp_workflow.sh
# with Docker
docker run --rm \
--mount type=bind,source="$(pwd)",target=/home/app \
--mount type=volume,source=ssh-config,target=/root/.ssh \
<your_image_name> \
/home/app/exp_workflow.sh
Log into the High-Performance Computing (HPC) cluster using your SSH config and key and navigate to your repository. Substitute the appropriate names for the placeholders <username>
and <repository>
:
ssh hpc
cd /scratch/<username>/<repository>
git pull # optionally pull the latest changes you have done locally
Launch pipeline jobs either individually or in parallel. To launch multiple trainings at once with parameter grids or predefined parameter sets, modify multi_submission.py
:
# submit a single Slurm job:
sbatch slurm_job.sh # optional args for slurm_job.sh
# submit multiple Slurm jobs at once:
venv/bin/python multi_submission.py # optional args for slurm_job.sh
Check the status of all jobs associated with your user account:
squeue -u <user_name>
Monitor SLURM logs in real-time:
cd logs/slurm
tail -f slurm-<slurm_job_id>.out
To kill a single job using the SLURM job id or all jobs per user:
# Per job id
scancel <slurm_job_id>
# Per user
scancel -u <user_name>
Use the sync_logs.sh script to sync logs on your local machine in the logs/
directory every 30 seconds:
./sync_logs.sh
Then open a new terminal, launch TensorBoard to monitor experiments:
tensorboard --logdir=logs/tensorboard
Access TensorBoard via your browser at:
localhost:6006
Tip: You can also view TensorBoard logs in VSCode using the official extension.
To start TensorBoard remotely on the SSH Host and access it in your browser:
tensorboard --logdir=<TUSTU_TENSORBOARD_HOST_DIR>/<TUSTU_PROJECT_NAME>/logs/tensorboard --path_prefix=/tb1
Note: For an overview of all DVC experiments, it is important to start TensorBoard on the collected logs folder tensorboard/, where all experiments are organized in subdirectories.
Access TensorBoard via your browser at:
<your_domain>/tb1
If the exp_workflow.sh did not run through all steps, the temporary subdirectory in tmp/
in the root of the repository, will not be deleted. If for example the dvc exp push origin
failed, you can cd
into the subdirectory in tmp/
and manually try to push the experiment again:
cd tmp/<experiment_subdirectory>
dvc exp push origin
Each time we run the pipeline, DVC creates a new experiment. These are saved as custom Git references that can be retrieved and applied to your workspace. These references do not appear in the Git log, but are stored in the .git/refs/exps
directory and can be pushed to the remote Git repository. This is done automatically at the end of the exp_workflow.sh with dvc exp push origin
. All outputs and dependencies are stored in the .dvc/cache
directory and pushed to the remote DVC storage when the experiment is pushed. Since we create a new temporary copy of the repository for each pipeline run (and delete it at the end), the experiments will not automatically appear in the main repository.
To retrieve, view, and apply an experiment, do the following (either locally or on the HPC cluster):
# Get all experiments from remote
dvc exp pull origin
# List experiments
dvc exp show
# Apply a specific experiment
dvc exp apply <dvc_exp_name>
Note: By default, experiments are tied to the specific Git commit of their execution. Therefore the commands
dvc exp pull origin
anddvc exp show
only work for experiments associated with the same commit used when the experiment was created. To pull and show experiments from a different or all commits, you can use specific flags as outlined in the DVC documentation.
Tip: You can also get the Git ref hash of the experiment from
dvc exp show
and do agit diff
.
To clean up copies of repositories of failed experiments, use this command from the root of your repository:
rm -rf tmp/
For information on cleaning up the DVC cache, refer to the DVC Documentation.
Note: Be careful with this, as we are using a shared cache between parallel experiment runs.