cbg-ethz · DrYak · Oct 1, 2022 · Dec 12, 2021 · Dec 13, 2021 · Dec 14, 2021
diff --git a/docs/example_HIV_data/README.md b/docs/example_HIV_data/README.md
@@ -0,0 +1,20 @@
+The samples were taken from the publication Abrahams et al. (2019), Science translational medicine 11.513 (DOI: 10.1126/scitranslmed.aaw5589).
+
+We download the following HIV Multiplexed Illumina MiSeq data from the short read archive (SRA): SRR9588830, SRR9588828, SRR9588844 and SRR9588785. They where taken from a HIV-1 positive patients at different time points post-infection.
+
+To download the data, we used ``` sra-tools ```.
+
+```bash
+mkdir -p samples/CAP188/4/
+cd samples/CAP188/4/
+fastq-dump -O raw_data --split-e  SRR9588828
+```
+
+Using the `--split-e` option, we download the reads seperatled into forward and reverse reads. Here you can find some more information on `fastq-dump`: https://edwards.flinders.edu.au/fastq-dump/
+
+We aligned the reads to the HIV strain HXB2, retrieved reads covering the region HXB2:2453-3356, and further subsampled to have a feasible sized sample to run on a laptop. 
+
+```
+samtools view REF_aln.bam -h "HXB2:2453-3356" > output_region.bam
+samtools view -s 0.10 -b output_region.bam > output_region_subsample.bam
+```
diff --git a/docs/example_HIV_data/samples/CAP188/30/raw_data/readsEnv_R1.fastq b/docs/example_HIV_data/samples/CAP188/30/raw_data/readsEnv_R1.fastq
diff --git a/docs/example_HIV_data/samples/CAP188/30/raw_data/readsEnv_R2.fastq b/docs/example_HIV_data/samples/CAP188/30/raw_data/readsEnv_R2.fastq
diff --git a/docs/example_HIV_data/samples/CAP188/4/raw_data/readsEnv_R1.fastq b/docs/example_HIV_data/samples/CAP188/4/raw_data/readsEnv_R1.fastq
diff --git a/docs/example_HIV_data/samples/CAP188/4/raw_data/readsEnv_R2.fastq b/docs/example_HIV_data/samples/CAP188/4/raw_data/readsEnv_R2.fastq
diff --git a/docs/example_HIV_data/samples/CAP217/4390/raw_data/readsEnv_R1.fastq b/docs/example_HIV_data/samples/CAP217/4390/raw_data/readsEnv_R1.fastq
diff --git a/docs/example_HIV_data/samples/CAP217/4390/raw_data/readsEnv_R2.fastq b/docs/example_HIV_data/samples/CAP217/4390/raw_data/readsEnv_R2.fastq
diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -0,0 +1,209 @@
+---
+jupyter:
+  jupytext:
+    cell_metadata_filter: -all
+    formats: ipynb,md
+    text_representation:
+      extension: .md
+      format_name: markdown
+      format_version: '1.3'
+      jupytext_version: 1.13.1
+  kernelspec:
+    display_name: Python 3
+    language: python
+    name: python3
+---
+
+
+# V-Pipe Tutorial
+
+V-pipe is a workflow designed for the analysis of next generation sequencing (NGS) data from viral pathogens. It produces a number of results in a curated format (e.g., consensus sequences, SNV calls, local/global haplotypes). V-pipe is written using the Snakemake workflow management system.
+
+## Requirements
+
+V-pipe is optimized for Linux or Mac OS systems. Therefore, we recommend users with a Windows system to install WSL2 - this is not a full virtual machine but rather a way to run Windows and Linux cooperatively at the same time.  
+
+
+## Organizing Data
+
+V-Pipe takes as an input raw data in FASTQ format and depending on the user-defined configuration will output consensus sequences, SNV calls and local/global haplotypes.
+
+V-pipe expects the input samples to be organized in a two-level hierarchy:
+
+At the first level, input files are grouped by samples (e.g.: patients or biological replicates of an experiment).
+At the second level, different datasets belonging to the same sample (e.g., from sample dates) are distinguished.
+Inside the 2nd-level directory, the sub-directory `raw_data` holds the sequencing data in FASTQ format (optionally compressed with GZip).
+Paired-ended reads need to be in split files with suffixes `_R1` and `_R2`.
+
+```
+samples
+|───patient1
+│   └───date1
+│       └───raw_data
+│           |───reads_R1.fastq
+│           └───reads_R2.fastq
+└───patient2
+    |───date1
+    |   └───raw_data
+    |       |───reads_R1.fastq
+    |       └───reads_R2.fastq
+    └───date2
+        └───raw_data
+            |───reads_R1.fastq
+            └───reads_R2.fastq
+```
+
+## Preparing a small dataset
+
+In the directory `example_HIV_data` you find a small test dataset that you can run on your workstation or laptop.
+The files will have the following structure:
+
+```
+samples
+|└───CAP217
+│   └───4390
+│       └───raw_data
+│           |───reads_R1.fastq
+│           └───reads_R2.fastq
+└───CAP188
+    |───4
+    |   └───raw_data
+    |       |───reads_R1.fastq
+    |       └───reads_R2.fastq
+    └───30
+        └───raw_data
+            |───reads_R1.fastq
+            └───reads_R2.fastq
+```
+
+## Install V-pipe
+
+V-pipe uses the [Bioconda](https://bioconda.github.io/) bioinformatics software repository for all its pipeline components. The pipeline itself is implemented using [Snakemake](https://snakemake.readthedocs.io/en/stable/).
+
+For advanced users: If your are fluent with these tools, you can:
+
+* directly download and install [bioconda](https://bioconda.github.io/user/install.html) and [snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda),
+* specifiy your V-pipe configuration, and start using V-pipe
+
+Use `--use-conda` to [automatically download and install](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) any further pipeline dependencies. Please refer to the documentation for additional instructions.
+
+In this present tutorial you will learn how to setup a workflow for the example dataset.
+
+To deploy V-pipe, you can use the installation script with the following parameters:
+
+```bash
+curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
+bash quick_install.sh -p testing -w work
+```
+
+Note that
+
+* using `-p` specifies the subdirectory where to download and install snakemake and V-pipe
+* using `-w` will create a working directory and populate it. It will colloquial the references and the default `config/config.yaml`, and create a handy `vpipe` short-cut script to invoke `snakemake`.
+
+
+If you get `zsh: permission denied: ./quick_install.sh`, run `chmod +x quick_install.sh` this gives the necessary permissions.
+
+Tip: To create and populate other new working directories, you can call init_project.sh from within the new directory:
+
+```console
+mkdir -p working_2
+cd working_2
+../V-pipe/init_project.sh
+```
+
+
+## Preparation
+
+Copy the samples directory you created in the step "Preparing a small dataset" to this working directory. You can display the directory structure with `tree sample`s or `find samples`.
+
+```bash
+mv ./samples ./testing/work/resources/
+```
+
+### Reference
+If you have a reference sequences that you would like to use for read mapping and alignment, then add it to the `resources/reference/ref.fasta` directory. In our case, however, we will use the reference sequence HXB2 already provided by V-Pipe `V-pipe/resources/hiv/HXB2.fasta`.
+
+### Preparing V-pipe's configuration
+
+In the `work`  directory you can find the file `config.yaml`. This is where the V-Pipe configuation should be specified. See [here] (https://github.com/cbg-ethz/V-pipe/tree/master/config#readme) for the documentation of the configuration. In this tutorial we are building our own configuration therefore `virus_base_config` will remain empty. Since we are working with HIV-1, V-Pipe is providing meta information that will be used for visualisation (metainfo_file and gff_directory).
+
+```bash
+general:
+    virus_base_config: ''
+    aligner: "bwa"
+    snv_caller: "shorah"
+    haplotype_reconstruction: "haploclique"
+
+input:
+    reference: "{VPIPE_BASEDIR}/../resources/hiv/HXB2.fasta"
+    metainfo_file: "{VPIPE_BASEDIR}/../resources/hiv/metainfo.yaml"
+    gff_directory: "{VPIPE_BASEDIR}/../resources/hiv/gffs/"
+    datadir: "{VPIPE_BASEDIR}/../../work/resources/samples"
+    read_length: 301
+    samples_file: samples.tsv
+    paired: true
+
+snv:
+    consensus: false
+
+output:
+    snv: true
+    local: true
+    global: true
+    visualization: true
+    QA: false
+    diversity: true
+```
+
+Note: A YAML files use spaces as indentation, you can use 2 or 4 spaces for indentation, but no tab. There are also online YAML file validators that you might want to use if your YAML file is wrongly formatted.
+
+## Running V-pipe
+
+
+Before running check what will be executed:
+
+```bash
+cd ./testing/work/
+./vpipe --dryrun
+```
+
+As this is your first run of V-pipe, it will also generate the sample collection table. Check `samples.tsv` in your editor.
+
+Note that the samples you have downloaded have reads of length 301 only. V-pipe’s default parameters are optimized for reads of length 250. To adapt to the read length, add a third column in the tab-separated file as follows:
+
+```bash
+cat ./testing/work/samples.tsv
+CAP217	4390	301
+CAP188	4	301
+CAP188	30	301
+```
+
+Always check the content of the `samples.tsv` file.
+
+If you did not use the correct directory structure, this file might end up empty or some entries might be missing.
+You can safely delete it and re-run with option `--dry-run` to regenerate it.
+
+Finally, we can run the V-pipe analysis (the necessary dependencies will be downloaded and installed in conda environments managed by snakemake):
+
+```bash
+cd ./testing/work/
+./vpipe -p --cores 2
+```
+
+
+## Output
+
+The Wiki contains an overview of the output files. The output of the SNV calling step is aggregated in a standard [VCF](https://en.wikipedia.org/wiki/Variant_Call_Format) file, located in `samples/{hierarchy}/variants/SNVs/snvs.vcf`. You can open it with your favorite VCF tools for visualisation or downstream processing. It is also available in a tabular format in `samples/{hierarchy}/variants/SNVs/snvs.csv`.
+
+
+## Swapping component
+
+The default configuration uses ShoRAH to call the SNVs and to reconstruct the local (windowed) haplotypes.
+
+Components of the pipeline can be swapped simply by changing the `config.yaml` file. For example to call SNVs using lofreq instead of ShoRAH use
+
+```yaml
+general:
+  snv_caller: lofreq
+```
diff --git a/resources/auxiliary_workflows/benchmark/README.md b/resources/auxiliary_workflows/benchmark/README.md
@@ -0,0 +1,25 @@
+# Quasispecies Reconstruction Benchmark
+
+Benchmark Quasispecies assembly methods both on the level of local as well as global haplotypes.
+
+## Usage
+
+To run the workflow, execute the following:
+
+```bash
+# locally (remove docker part if on linux)
+docker run --rm -v $PWD:/foo --workdir=/foo snakemake/snakemake:stable snakemake -prj1 --use-conda
+
+# on cluster
+./run_workflow.sh
+```
+
+## Adding new methods
+
+To run a new method/tool as part of the benchmark workflow, add a script to `resources/method_definitions/`.
+Each script must be classified as either `local` (produces a VCF file) or global (produces a FASTA file) by adding `# GROUP: local` or `# GROUP: global` respectively.
+Method dependencies can be specified as comments.
+Conda packages can be added by writing `# CONDA: <package name> = <version>`.
+Analogously, PIP packages can be added by writing `# PIP: <package name>`.
+Multiple packages can be added by repeating these lines.
+A conda environment will then be dynamically generated (when running Snakemake with `--use-conda`).
diff --git a/resources/auxiliary_workflows/benchmark/config/config.yaml b/resources/auxiliary_workflows/benchmark/config/config.yaml
@@ -0,0 +1,13 @@
+# `null` will execut all methods
+# list of strings will execute selected methods
+method_list: [haploclique, quasirecomb, predicthaplo, haploconduct, cliquesnv]
+
+replicate_count: 10
+
+haplotype_generation: distance  # distance or mutation_rate
+
+params_path: config/params.csv
+
+master_seq_path: null
+# if None, then generate MasterSequence by drawing bases
+# uniformly at random using the user-provided genome length
diff --git a/resources/auxiliary_workflows/benchmark/config/params.csv b/resources/auxiliary_workflows/benchmark/config/params.csv
@@ -0,0 +1,10 @@
+# for distance mode:
+# haplos = n_group1,n_group2,d_group12,d_group1,d_group2,freq_dist,freq_param
+# for mutation mode:
+# haplos = mutation_rate,insertion_rate,deletion_rate,haplotype_pattern
+# parameters are seperated with "@"
+seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
+illumina,shotgun,,240,1000,100,5@5@20@10@6@[email protected]
+# illumina,amplicon,400:100,200,1000,100,5@5@20@10@6@dirichlet@1:1:1:1:1:1:1:1:1:1
+# illumina,shotgun,,240,1000,100,0.1@0@[email protected]:0.4
+# illumina,amplicon,400:100,200,1000,100,0.1@0@[email protected]:0.4
diff --git a/...auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_amplicon/config.yaml b/...auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_amplicon/config.yaml
@@ -0,0 +1,5 @@
+method_list: [lofreq_local_haplo, shorah_default, shorah_mfa_qualities_unique_uniform, shorah_mfa_s1_a0.000001_relaxConv_uniform]
+replicate_count: 1
+haplotype_generation: mutation_rate
+params_path: config_amplicon/params.csv
+master_seq_path: /Users/lfuhrmann/Documents/Projects/V-pipe/resources/hiv/HXB2.fasta
diff --git a/.../auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_amplicon/params.csv b/.../auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_amplicon/params.csv
@@ -0,0 +1,7 @@
+seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
+illumina,amplicon,400:100,200,9720,1000,0.1@0@[email protected]:0.2:0.05:0.05
+#illumina,amplicon,400:100,200,9720,1000,0.01@0@[email protected]:0.2:0.05:0.05
+#illumina,amplicon,400:100,200,9720,1000,0.001@0@[email protected]:0.2:0.05:0.05
+#illumina,amplicon,400:10,200,9720,1000,0.1@0@[email protected]:0.2:0.05:0.05
+#illumina,amplicon,400:10,200,9720,1000,0.01@0@[email protected]:0.2:0.05:0.05
+#illumina,amplicon,400:10,200,9720,1000,0.001@0@[email protected]:0.2:0.05:0.05
diff --git a/...auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_distance/config.yaml b/...auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_distance/config.yaml
@@ -0,0 +1,5 @@
+method_list: [lofreq_local_haplo, shorah_default_amplicon, shorah_mfa_qualities_unique, shorah_mfa_s1_a0.000001_relaxConv]
+replicate_count: 1
+haplotype_generation: distance
+params_path: config_distance/params.csv
+master_seq_path: null
diff --git a/.../auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_distance/params.csv b/.../auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_distance/params.csv
@@ -0,0 +1,5 @@
+seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
+illumina,shotgun,,201,201,100,5@5@20@5@10@[email protected]
+illumina,shotgun,,201,201,1000,5@5@20@5@10@[email protected]
+illumina,shotgun,,201,201,100,5@20@10@5@5@[email protected]
+illumina,shotgun,,201,201,1000,5@20@10@5@5@[email protected]
diff --git a/...uxiliary_workflows/benchmark/resources/local_haplotype_setup/config_longreads/config.yaml b/...uxiliary_workflows/benchmark/resources/local_haplotype_setup/config_longreads/config.yaml
@@ -0,0 +1,5 @@
+method_list: [lofreq_local_haplo, shorah_default, shorah_mfa_qualities_unique_uniform, shorah_mfa_s1_a0.000001_relaxConv_uniform, shorah_mfa_stick_qualities_unique_uniform, shorah_mfa_qualities_unique, shorah_mfa_s1_a0.000001_relaxConv]
+replicate_count: 1
+haplotype_generation: distance
+params_path: config_longreads/params.csv
+master_seq_path: ../../hiv/HXB2.fasta
diff --git a/...auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_longreads/params.csv b/...auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_longreads/params.csv
@@ -0,0 +1,4 @@
+seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
+pacbio,shotgun,,9720,9720,500,5@5@200@50@100@[email protected]
+pacbio,shotgun,,9720,9720,1000,5@5@200@50@100@[email protected]
+pacbio,shotgun,,9720,9720,2000,5@5@200@50@100@[email protected]
diff --git a/...liary_workflows/benchmark/resources/local_haplotype_setup/config_mutationrate/config.yaml b/...liary_workflows/benchmark/resources/local_haplotype_setup/config_mutationrate/config.yaml
@@ -0,0 +1,5 @@
+method_list: [lofreq_local_haplo, shorah_default_amplicon, shorah_mfa_qualities_unique, shorah_mfa_s1_a0.000001_relaxConv]
+replicate_count: 1
+haplotype_generation: mutation_rate
+params_path: config_mutationrate/params.csv
+master_seq_path: null
diff --git a/...iliary_workflows/benchmark/resources/local_haplotype_setup/config_mutationrate/params.csv b/...iliary_workflows/benchmark/resources/local_haplotype_setup/config_mutationrate/params.csv
@@ -0,0 +1,5 @@
+seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
+illumina,shotgun,,201,201,100,0.1@0@[email protected]:0.2:0.05:0.05
+illumina,shotgun,,201,201,100,0.01@0@[email protected]:0.2:0.05:0.05
+illumina,shotgun,,201,201,1000,0.1@0@[email protected]:0.2:0.05:0.05
+illumina,shotgun,,201,201,1000,0.01@0@[email protected]:0.2:0.05:0.05
diff --git a/...auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_realdata/config.yaml b/...auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_realdata/config.yaml
@@ -0,0 +1,5 @@
+method_list: [lofreq_local_haplo, shorah_default, shorah_mfa_qualities_unique_uniform, shorah_mfa_s1_a0.000001_relaxConv_uniform]
+replicate_count: 1
+haplotype_generation: null
+params_path: config_realdata/params.csv
+master_seq_path: null
diff --git a/.../auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_realdata/params.csv b/.../auxiliary_workflows/benchmark/resources/local_haplotype_setup/config_realdata/params.csv
@@ -0,0 +1,5 @@
+seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
+#illumina,real_data,5-virus-mix@1,,,,
+#illumina,real_data,[email protected],,,,
+illumina,real_data,[email protected],,,,
+illumina,real_data,[email protected],,,,
diff --git a/...workflows/benchmark/resources/local_haplotype_setup/config_realdata_SARS_CoV2/config.yaml b/...workflows/benchmark/resources/local_haplotype_setup/config_realdata_SARS_CoV2/config.yaml
@@ -0,0 +1,11 @@
+# `null` will execut all methods
+# list of strings will execute selected methods
+method_list: [lofreq_local_haplo, shorah_mfa_qualities_unique_uniform, shorah_mfa_s1_a0.000001_relaxConv_uniform]
+
+replicate_count: 1
+
+haplotype_generation: null  # distance or mutation_rate
+
+params_path: config_realdata_CoV2/params.csv
+
+master_seq_path: null
diff --git a/..._workflows/benchmark/resources/local_haplotype_setup/config_realdata_SARS_CoV2/params.csv b/..._workflows/benchmark/resources/local_haplotype_setup/config_realdata_SARS_CoV2/params.csv
@@ -0,0 +1,5 @@
+seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
+illumina,amplicon,real_data,2-SARS-CoV-2-mix,C2_Wild_10_03,0.25,
+illumina,amplicon,real_data,2-SARS-CoV-2-mix,E1_Wild_50_02,0.25,
+illumina,amplicon,real_data,2-SARS-CoV-2-mix,G1_Wild_100_01,0.25,
+illumina,amplicon,real_data,2-SARS-CoV-2-mix,H1_Wild_100_02,0.25,
diff --git a/resources/auxiliary_workflows/benchmark/resources/local_haplotype_setup/lsf.yaml b/resources/auxiliary_workflows/benchmark/resources/local_haplotype_setup/lsf.yaml
@@ -0,0 +1,2 @@
+__default__:
+        - "-R \"select[model==EPYC_7H12]\""
diff --git a/resources/auxiliary_workflows/benchmark/resources/local_haplotype_setup/run_workflow.sh b/resources/auxiliary_workflows/benchmark/resources/local_haplotype_setup/run_workflow.sh
@@ -0,0 +1,16 @@
+#!/usr/bin/env bash
+
+bsub \
+  -N \
+  -R 'rusage[mem=5000]' \
+  -W 120:00 \
+  -oo snake.out -eo snake.err \
+snakemake \
+  --profile lsf \
+  --rerun-incomplete \
+  -pr \
+  --cores 200 \
+  --use-conda \
+  --latency-wait 30 \
+  --show-failed-logs \
+  "$@"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		__default__:
		- "-R \"select[model==EPYC_7H12]\""