User Guide

Adding new data to your DVC remote

To track and add new data inputs (e.g., data/raw):

dvc add data/raw
dvc push

Note: Only necessary if you want to track new data inputs that are not already declared in the dvc.yaml file as outputs of a stage.

Info: The files added with DVC should be Git-ignored, but adding with DVC will automatically create .gitignore files. What Git tracks are references with a .dvc suffix (e.g. data/raw.dvc). Make sure you add and push the .dvc files to the Git remote.

Update Docker Image / Dependencies

As your requirements change, always update requirements.txt with your fixed versions:

pip freeze > requirements.txt

Docker images are automatically rebuilt and pushed to Docker Hub by the GitHub workflow when requirements.txt, Dockerfile, or docker_image.yml are updated and pushed to GitHub. If you trigger an image build, ensure it is completed and pushed to Docker Hub before proceeding.

Note: For the free docker/build-push-action, there is a 14GB storage limit for free public repositories on GitHub runners (About GitHub runners). Therefore, the Docker image must not exceed this size.

On the HPC cluster, the Docker image is then automatically pulled and converted to a Singularity image when running the slurm_job.sh and no image is found within the repository.

sbatch slurm_job.sh

If you want to force the update of the Singularity image, you can use the flag --rebuild-container or delete the existing image on the cluster when submitting the job.

sbatch slurm_job.sh --rebuild-container

Launch ML Pipeline

Locally natively or with Docker

To run the entire pipeline locally or locally within a docker container, execute the following commands:

# natively
./exp_workflow.sh
# with Docker
docker run --rm \
  --mount type=bind,source="$(pwd)",target=/home/app \
  --mount type=volume,source=ssh-config,target=/root/.ssh \
  <your_image_name> \
  /home/app/exp_workflow.sh

On the HPC Cluster

Log into the High-Performance Computing (HPC) cluster using your SSH config and key and navigate to your repository. Substitute the appropriate names for the placeholders <username> and <repository>:

ssh hpc
cd /scratch/<username>/<repository>
git pull # optionally pull the latest changes you have done locally

Launch pipeline jobs either individually or in parallel. To launch multiple trainings at once with parameter grids or predefined parameter sets, modify multi_submission.py:

# submit a single Slurm job:
sbatch slurm_job.sh # optional args for slurm_job.sh
# submit multiple Slurm jobs at once:
venv/bin/python multi_submission.py # optional args for slurm_job.sh

Monitoring and Logs

SLURM Job Monitoring

Check the status of all jobs associated with your user account:

squeue -u <user_name>

Monitor SLURM logs in real-time:

cd logs/slurm
tail -f slurm-<slurm_job_id>.out

To kill a single job using the SLURM job id or all jobs per user:

# Per job id
scancel <slurm_job_id>
# Per user
scancel -u <user_name>

Local Monitoring with TensorBoard

Use the sync_logs.sh script to sync logs on your local machine in the logs/ directory every 30 seconds:

./sync_logs.sh

Then open a new terminal, launch TensorBoard to monitor experiments:

tensorboard --logdir=logs/tensorboard

Access TensorBoard via your browser at:

localhost:6006

Tip: You can also view TensorBoard logs in VSCode using the official extension.

Remote Monitoring with TensorBoard

To start TensorBoard remotely on the SSH Host and access it in your browser:

tensorboard --logdir=<TUSTU_TENSORBOARD_HOST_DIR>/<TUSTU_PROJECT_NAME>/logs/tensorboard --path_prefix=/tb1

Note: For an overview of all DVC experiments, it is important to start TensorBoard on the collected logs folder tensorboard/, where all experiments are organized in subdirectories.

Access TensorBoard via your browser at:

<your_domain>/tb1

Troubleshooting

If the exp_workflow.sh did not run through all steps, the temporary subdirectory in tmp/ in the root of the repository, will not be deleted. If for example the dvc exp push origin failed, you can cd into the subdirectory in tmp/ and manually try to push the experiment again:

cd tmp/<experiment_subdirectory>
dvc exp push origin

DVC Experiment Retrieval

Each time we run the pipeline, DVC creates a new experiment. These are saved as custom Git references that can be retrieved and applied to your workspace. These references do not appear in the Git log, but are stored in the .git/refs/exps directory and can be pushed to the remote Git repository. This is done automatically at the end of the exp_workflow.sh with dvc exp push origin. All outputs and dependencies are stored in the .dvc/cache directory and pushed to the remote DVC storage when the experiment is pushed. Since we create a new temporary copy of the repository for each pipeline run (and delete it at the end), the experiments will not automatically appear in the main repository.

To retrieve, view, and apply an experiment, do the following (either locally or on the HPC cluster):

# Get all experiments from remote
dvc exp pull origin
# List experiments
dvc exp show
# Apply a specific experiment
dvc exp apply <dvc_exp_name>

Note: By default, experiments are tied to the specific Git commit of their execution. Therefore the commands dvc exp pull origin and dvc exp show only work for experiments associated with the same commit used when the experiment was created. To pull and show experiments from a different or all commits, you can use specific flags as outlined in the DVC documentation.

Tip: You can also get the Git ref hash of the experiment from dvc exp show and do a git diff.

Clean Up

To clean up copies of repositories of failed experiments, use this command from the root of your repository:

rm -rf tmp/

For information on cleaning up the DVC cache, refer to the DVC Documentation.

Note: Be careful with this, as we are using a shared cache between parallel experiment runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USAGE.md

USAGE.md

User Guide

Adding new data to your DVC remote

Update Docker Image / Dependencies

Launch ML Pipeline

Locally natively or with Docker

On the HPC Cluster

Monitoring and Logs

SLURM Job Monitoring

Local Monitoring with TensorBoard

Remote Monitoring with TensorBoard

Troubleshooting

DVC Experiment Retrieval

Clean Up

Files

USAGE.md

Latest commit

History

USAGE.md

File metadata and controls

User Guide

Adding new data to your DVC remote

Update Docker Image / Dependencies

Launch ML Pipeline

Locally natively or with Docker

On the HPC Cluster

Monitoring and Logs

SLURM Job Monitoring

Local Monitoring with TensorBoard

Remote Monitoring with TensorBoard

Troubleshooting

DVC Experiment Retrieval

Clean Up