Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature tutorial - pull new tutorial with example data to master branch #131

Merged
merged 271 commits into from
Oct 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
271 commits
Select commit Hold shift + click to select a range
dc2cfd4
Add runtime overview plots
kpj Dec 12, 2021
7295342
Dynamically generate conda envs for each job based on job params
kpj Dec 13, 2021
2399205
Refactor helper function definitions
kpj Dec 14, 2021
0288bea
Make dynamically generated conda envs compatible with official Snakem…
kpj Dec 14, 2021
ec4f44c
Support PIP packages in dynamically generated conda envs
kpj Dec 14, 2021
5b72452
Explain dynamic conda env generation in README.md
kpj Dec 14, 2021
22adf55
Format shorah_default.py
kpj Dec 15, 2021
a9df076
save performance measures also as csv
Jan 7, 2022
e83f2a1
Remove obsolete file
kpj Jan 9, 2022
38f1727
Set Python version of dynamic conda envs to 3.9
kpj Jan 19, 2022
cd0f31b
compute performance: fixed corner cases
Feb 4, 2022
42fb2db
compute performance: another fix - woopsie
Feb 4, 2022
b5ba386
inclusion of new inference method for shorah
Feb 6, 2022
94f9e6d
Set boost and htslib for dynamic conda envs
Feb 10, 2022
d2f05c9
Print method list to stderr
kpj Feb 11, 2022
01e5db6
Merge branch 'feature-benchmark' of github.com:cbg-ethz/V-pipe into f…
kpj Feb 11, 2022
411ad04
Revert "Set boost and htslib for dynamic conda envs"
kpj Feb 11, 2022
149023c
Add boost and htslib dependencies to method scripts
kpj Feb 11, 2022
2be095f
Format method scripts
kpj Feb 11, 2022
b0c638a
[Update] update to new shorah_canary version
Apr 7, 2022
bafa08a
[Refactor] seperate rules for haplotype generation and read simulation
Apr 8, 2022
218a0f8
[Added] Option to simulate haplotype population based on pairwise dis…
Apr 11, 2022
0ca9860
[Added] Option to simulate haplotype population based on pairwise dis…
Apr 11, 2022
f989d2a
[Refactor] shorter column names
Apr 11, 2022
d2c8799
[Added] amplicon simulation scheme
Apr 11, 2022
7ce4244
[Reformatted] with black
Apr 11, 2022
922d3dd
[Added] amplicon simulation scheme
Apr 11, 2022
a654f3b
[delete] reads are simulated in amplicon_simulation.py or shotgun_sim…
Apr 11, 2022
3cc740c
[Added] amplicon simulation
Apr 11, 2022
e548c8d
[Added] amplicon simulation
Apr 11, 2022
3096962
[Fix] Catch case where tp == 0
Apr 12, 2022
5b40f8a
[Added] amplicon simulation with specification of amplicon length and…
Apr 12, 2022
668babb
[Added] lofreq as method
Apr 12, 2022
14b38e7
Format code
kpj Apr 12, 2022
31f6382
Flexible seq_mode
Apr 12, 2022
b15ce78
Flexible seq_mode
Apr 12, 2022
25277e4
[Fix] 0/1-bug as insert.bed is 1-based
Apr 13, 2022
5e1c884
Format Snakefile
kpj Apr 14, 2022
2a2bb12
Set default replicate count back to 10
kpj Apr 14, 2022
8fade76
Support global haplotype reconstruction methods in benchmark
kpj Apr 14, 2022
babaf74
Mention method grouping in readme
kpj Apr 14, 2022
a92f744
Format some rules
kpj Apr 20, 2022
bd3fda3
Add PredictHaplo script draft
kpj Apr 21, 2022
972cb41
Allow commenting out entries in params.csv
kpj Apr 21, 2022
5b87483
Add cliquesnv method
kpj Apr 21, 2022
335f448
Keep header when converting BAM to SAM for PredictHaplo
kpj Apr 21, 2022
3c2c31f
Add quasirecomb method
kpj Apr 21, 2022
88f4081
Add missing samtools dependencies
kpj Apr 21, 2022
f28ffc2
Mention docker in readme to allow linux-only tools on macOS
kpj Apr 21, 2022
bc3f933
Add haploconduct (savage) method
kpj Apr 21, 2022
edfc24a
Add pehaplo method (draft due to crash)
kpj Apr 21, 2022
b89a250
Index BAM as part of workflow
kpj Apr 21, 2022
49a35ef
Use subprocess.run with check=True to raise Python exception when too…
kpj Apr 21, 2022
acec29a
Add abayesqr method (draft)
kpj Apr 21, 2022
ffb6f41
Add HCC conda channel
kpj Apr 21, 2022
8d35012
Improve handling of rule which produces different output files based …
kpj Apr 27, 2022
9217a1e
Let tools use unmerged (paired) BAM in amplicon mode
kpj May 3, 2022
86b9bf8
Run predicthaplo with '--min_overlap_factor 0.1'
kpj May 3, 2022
4719cc7
Implement experimental result aggregation for predicthaplo
kpj May 3, 2022
931305b
Save all ground truth haplotypes in single fasta file
kpj May 3, 2022
0e7d939
Make individual haplotype caching simpler
kpj May 3, 2022
63583d8
Provide ground truth haplotypes to global performance evaluation
kpj May 3, 2022
5f66b62
Fix haploclique being unable to parse reads by converting X/= to M in…
kpj May 4, 2022
f6f848d
Run metaquast in global performance evaluation
kpj May 4, 2022
6726a54
Retain header when converting SAM to BAM
kpj May 4, 2022
635fbd2
Retain haplotype frequency information
kpj May 5, 2022
03e1812
Add MDS of global haplotypes
kpj May 5, 2022
b0c25c8
Parse haplotype frequency as float
kpj May 5, 2022
554daf1
Improve reference renaming in shotgun simulation
kpj May 6, 2022
d08abc1
Automatically create coverage plots from BAM
kpj May 6, 2022
06e71b7
Run 'samtools merge' with -f
kpj May 6, 2022
06fb9a4
Run metaquast with --unique-mapping
kpj May 6, 2022
dd6d11d
Use -f instead of -c in read simulation
kpj May 6, 2022
ba7ffaf
Fix yaml formatting
kpj Jun 8, 2022
a9a987b
Store frequency in ground truth
kpj Jun 22, 2022
6a4c265
shortening columns names in params.csv
Jun 22, 2022
66e62c5
[Added] feature computing diversity measures for generated haplotype …
Jun 22, 2022
dc92462
Add multi-steup workflow
kpj Jun 23, 2022
e104ed6
adapted params.csv such that it matches with config.yaml
Jun 23, 2022
3947119
parameters for distance simulation
Jun 23, 2022
9f36453
parameters for mutation_rate simulation
Jun 23, 2022
4234d3f
[mutation_rate mode] allow numerical inaccuracy for sum of frequency …
Jun 23, 2022
1462662
added description
Jun 23, 2022
8c99131
new strucutre of params.csv such that column names are the same for d…
Jun 23, 2022
68f217f
Fix propagation of script path
kpj Jun 23, 2022
a16e5be
Improve paramspace separator
kpj Jun 23, 2022
9faede4
Fix filename parsing
kpj Jun 23, 2022
0d47ad5
Add missing fix
kpj Jun 23, 2022
009e800
Improve overview plot
kpj Jun 23, 2022
0ebdffb
Make frequency extraction more flexible
kpj Jun 23, 2022
254c512
[outdated] outdated methods in resources
Jun 24, 2022
9583213
in case of multi_setup the paths are one directory longer
Jun 24, 2022
347e69e
Improve sam reference name replacement
kpj Jun 27, 2022
d3d9d51
Merge branch 'feature-benchmark' of github.com:cbg-ethz/V-pipe into f…
kpj Jun 27, 2022
5c3aa6c
workflow for local hapltype reconstruction
Jun 27, 2022
3382021
Merge branch 'feature-benchmark' of github.com:cbg-ethz/V-pipe into f…
kpj Jun 27, 2022
a40529b
Add regresshaplo script
kpj Jun 27, 2022
10c3f9e
Add regresshaplo script
kpj Jun 27, 2022
d566a3e
Fix filename parsing
kpj Jun 27, 2022
7f92089
Improve catplot
kpj Jun 27, 2022
1558eb9
Set method field of ground truth dataframe
kpj Jun 27, 2022
f7d2dab
Update PredictHaplo script
kpj Jun 27, 2022
ae7f434
run with conda
Jun 27, 2022
d1361d3
added resources for methods
Jun 27, 2022
8b07620
Add missing make dependency to predicthaplo script
kpj Jun 27, 2022
5f7d276
Fix frequency field of quasirecomb
kpj Jun 27, 2022
030cd90
Parse haplostats in global evaluation
kpj Jun 28, 2022
a1a0c6b
Fix typo
kpj Jun 28, 2022
145cc1c
Add PR measures
kpj Jun 28, 2022
4a16bc2
Add diversity based plot
kpj Jun 28, 2022
4bcc2e5
Plot PR summaries for multiple diversity measures
kpj Jun 29, 2022
f46ffcd
Reorder plot execution
kpj Jun 29, 2022
d5a6703
Merge branch 'feature-benchmark' of https://github.com/cbg-ethz/V-pip…
Jun 29, 2022
51243ae
Select working method subset
kpj Jun 29, 2022
f0b03f8
Start adding resource requirements
kpj Jun 29, 2022
2ffa5ca
Formatting
kpj Jun 29, 2022
49ab69e
Code formatting
kpj Jun 29, 2022
ff23b4d
Start using RNG seeds
kpj Jun 29, 2022
669a89d
Fix comment
kpj Jun 29, 2022
70588c3
Improve plots
kpj Jun 29, 2022
cb69c6a
Fix CliqueSNV frequency field
kpj Jun 29, 2022
3077931
Fix workflow runners
kpj Jul 1, 2022
79fdbcf
merge stuff
Jul 1, 2022
07cdc9c
[add] methods for local_haplotype_setup
Jul 1, 2022
fd98ba3
add lsf.yaml for cluster running on same node
Jul 1, 2022
b6316da
Give shotgun_simulation more resources
kpj Jul 3, 2022
dac9977
Allow proper handling of crashed and timed out method runs
kpj Jul 3, 2022
097447c
Save performance results to file
kpj Jul 3, 2022
c3e84e1
Merge branch 'feature-benchmark' of github.com:cbg-ethz/V-pipe into f…
kpj Jul 3, 2022
63d6ba4
Fix performance input file enumeration
kpj Jul 3, 2022
96261da
Set method column to categorical
kpj Jul 3, 2022
eebe82e
Fix point clipping in limited plots
kpj Jul 3, 2022
05a8710
Add 5-virus-mix as real data benchmark case
kpj Jul 4, 2022
7c390c9
Decrease method timeout duration
kpj Jul 4, 2022
de1d094
Fix filename splitting in sub workflows
kpj Jul 5, 2022
a085ce1
5-virus-mix is indeed Illumina
kpj Jul 5, 2022
16351af
Make filename parsing more robust
kpj Jul 5, 2022
08020f0
Increase cluster resources for rule performance_measures_global
kpj Jul 5, 2022
c69ef6e
Give cliquesnv more resources
kpj Jul 5, 2022
e2cbccf
Fix haploclique preprocessing crash when CIGAR string is None
kpj Jul 5, 2022
7fc53f8
Run methods with more threads
kpj Jul 5, 2022
b534f20
Fix crash in rule haplotypes_stats when input ground truth is empty
kpj Jul 5, 2022
8d0554a
Subsample large haplotype numbers before MDS embedding
kpj Jul 6, 2022
6c2ea77
Improve MDS plots
kpj Jul 7, 2022
17da9b5
Fix PR computation and visualization
kpj Jul 7, 2022
5cede1c
Add more progress bars
kpj Jul 7, 2022
c3345d1
Small improvement
kpj Jul 7, 2022
4f58cdc
Use LRU cache in distance computation
kpj Jul 7, 2022
1433c0e
Fix PR calculation
kpj Jul 7, 2022
8ea6364
Fix PR plot
kpj Jul 7, 2022
b7c1b12
Prevent label overlap
kpj Jul 7, 2022
4c67dd7
Add distinct distance cases
kpj Jul 7, 2022
c52ba24
Save MDS results
kpj Jul 8, 2022
83a7fe6
Add automatic split_num estimation for haploconduct (savage)
kpj Jul 11, 2022
7ad7ea9
Fix call to pysam.depth
kpj Jul 11, 2022
9bf114c
Minor improvement
kpj Jul 11, 2022
9dd5fe1
Enable reference mode and set threads for HaploConduct
kpj Jul 11, 2022
cc84fef
Add dummy frequencies to HaploConduct results
kpj Jul 12, 2022
f3668a8
Create non-haploclique MDS plots
kpj Jul 12, 2022
d58da53
Create benchmark plots for global methods
kpj Jul 12, 2022
3b9bd7d
Disable quast profiling
kpj Jul 12, 2022
3b6749b
Make subsetted MDS plot more general
kpj Jul 12, 2022
63a26e9
Improve execution of QuasiRecomb
kpj Jul 12, 2022
800005b
Subsample PR input to prevent very long runtimes
kpj Jul 13, 2022
6d01da7
[long reads simulation] download hmm model for pbsim2
Jul 13, 2022
ec261a5
Make HaploConduct use biopython version which is compatible with Pyth…
kpj Jul 14, 2022
e35aad9
Merge branch 'feature-benchmark' of github.com:cbg-ethz/V-pipe into f…
kpj Jul 14, 2022
ad07e6c
[updated] relevant shorah versions
Jul 14, 2022
35be1be
merge branch 'feature-benchmark' of https://github.com/cbg-ethz/V-pip…
Jul 14, 2022
1f6af7f
[updated] relevant shorah versions
Jul 14, 2022
1fa6c9e
[fix] download of pbsim2-model for local_hapltoype_setup
Jul 14, 2022
3c15bca
[fix] download of pbsim2-model for local set up
Jul 14, 2022
656a173
[update] to run in local_haplo_setup
Jul 14, 2022
9be0344
[Add] rule collecting haplotype stats into one file
Jul 15, 2022
1caaee8
Update method lists
kpj Jul 16, 2022
d900f36
Benchmark real data with subsampling
kpj Jul 16, 2022
6e9fcad
Fix conda env path of rule 'provide_real_data'
kpj Jul 16, 2022
e376d28
Subsample bam file with deterministic seed
kpj Jul 16, 2022
364cf22
Fix CliqueSNV crash when no haplotypes were found
kpj Jul 17, 2022
33bf7b9
Remove haploclique from real_data benchmark because it uses too much …
kpj Jul 17, 2022
34b74b0
Add 5VM strain frequencies to ground truth fasta file
kpj Jul 17, 2022
c76dee8
Fix crash for missing diversity columns
kpj Jul 17, 2022
4ccc049
Update multi_setup output
kpj Jul 17, 2022
6931cf4
Improve rule 'download_pbsim2_model'
kpj Jul 17, 2022
396e7e2
Remove deletion markers in CliqueSNV results
kpj Jul 17, 2022
617a5fe
[Added] option to have a user-provided master sequence
Jul 18, 2022
e984a5b
Improve quast plots
kpj Jul 18, 2022
cbf88a3
[Added] option to have a user-provided master sequence
Jul 18, 2022
e646e7e
Fix quast execution
kpj Jul 18, 2022
56f602e
Re-enable quast performance measures
kpj Jul 18, 2022
692cab3
Make cluster resource allocation more flexible
kpj Jul 18, 2022
c49da20
Merge branch 'feature-benchmark' of github.com:cbg-ethz/V-pipe into f…
kpj Jul 18, 2022
a07d41d
Set config.master_seq_path to null by default
kpj Jul 18, 2022
b3959f8
Fix formatting warning
kpj Jul 18, 2022
26c4ca5
Add disabled auto-restart on cluster errors
kpj Jul 18, 2022
147525b
Remove deletion markers from PredictHaplo output
kpj Jul 18, 2022
b327625
Refactor parameter formatting
kpj Jul 18, 2022
70eca06
[local_haplto_setup] method definition shorah
Jul 18, 2022
d70fd50
Only keep varying parameters as x-axis labels
kpj Jul 18, 2022
f42d65d
Sort parameters using natsort
kpj Jul 18, 2022
6ac02c0
More formatting
kpj Jul 18, 2022
5e92d1d
Merge branch 'feature-benchmark' of github.com:cbg-ethz/V-pipe into f…
kpj Jul 18, 2022
c26c0d5
Set default edit distance to 0.01
kpj Jul 18, 2022
b79ab63
Re-enable 'distance_varyparams' subworkflow
kpj Jul 18, 2022
6936d08
Add 'master_seq_path' to subworkflow configs
kpj Jul 18, 2022
d3f52bc
Temporary formatting (maybe delete later for easier merging)
kpj Jul 18, 2022
483bd0d
[local_haplo_setup] correction output file of shorah_default
Jul 18, 2022
d40b8c6
[Added] option to have a user-provided master sequence
Jul 18, 2022
8d325b4
[local_haplo_setup] added real_data benchmark
Jul 18, 2022
2000c18
[local_haplo_setup] added real_data benchmark
Jul 18, 2022
de57e3d
Merge branch 'feature-benchmark' of https://github.com/cbg-ethz/V-pip…
Jul 18, 2022
b958c6a
formatting
Jul 18, 2022
b0a6ce9
Set label rotation to 45 degrees
kpj Jul 18, 2022
ed76b50
Parallelize PR computation
kpj Jul 18, 2022
3de9319
Parallelize MDS computation
kpj Jul 18, 2022
90e2e64
Set performance rule thread count to 10
kpj Jul 18, 2022
9b6f52f
Save performance CSVs earlier
kpj Jul 18, 2022
5a8e403
Skip MetaQUAST execution for empty contig files
kpj Jul 19, 2022
f25cbb3
Fix TP computation
kpj Jul 19, 2022
b16fa91
Improve quast plots
kpj Jul 19, 2022
bf71c63
Improve sorting of parameter strings
kpj Jul 19, 2022
793ba95
[update] benchmark settings for local_haplo_setup
Jul 20, 2022
daad684
[update]methods for local_haplo_setup
Jul 20, 2022
f71fb03
[added] lofreq to the benchmark methods
Jul 20, 2022
0bdd055
[local_haplotype_setup] config correction
Jul 20, 2022
f7fb958
[local_haplotype_setup] correction
Jul 20, 2022
ff01b2e
correct version
Jul 20, 2022
a8aeca7
some updates
Jul 20, 2022
27a8d95
[local_haplotype_setup] added run status
Jul 20, 2022
65be0ed
[local_haplotype_setup] correction config
Jul 21, 2022
53627be
[local_haplotype_setup] correction
Jul 21, 2022
6be24cf
[local_haplotype_setup] change of alpha in shorah method
Jul 21, 2022
0d3d862
[local_haplotype_setup] higher memory for realdata setup
Jul 21, 2022
33b6722
[local_haplotype_setup] realdata config
Jul 21, 2022
89d38f9
[local_haplotype_setup] increase resources for run method
Jul 22, 2022
0ac299b
[local_haplotype_setup] config_realdata -- subsampling
Jul 25, 2022
759eca5
[local_haplotype_setup] update methods in config files
Jul 25, 2022
d249e59
[local_haplotype_setup] config for long reads
Jul 25, 2022
dbc6413
[local_haplotype_setup] added amplicon illumina simulaation, adapted …
Jul 25, 2022
7e4bfa2
[local_haplotype_setup] running full benchmark for all configs
Jul 25, 2022
89980b1
Use conda function to dynamically generate conda envs
kpj Jul 26, 2022
7e67f8f
Integration of cowwid amplicon benchmarking data
Jul 26, 2022
fa3b74e
[local_haplotype_setup] add benchmarking on real data SARS-CoV-2 2-st…
Jul 26, 2022
a9acef0
initial steps for general tutorial
Aug 3, 2022
3027079
[updated]
Aug 22, 2022
948bb72
[update+add] added example HIV data, updated tutorial
Sep 18, 2022
ea7a1cd
smallcorrections
Sep 18, 2022
c205d53
smallcorrections
Sep 18, 2022
b1d7638
smallcorrections
Sep 18, 2022
62bad5a
Revert "Temporary formatting (maybe delete later for easier merging)"
DrYak Oct 1, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions docs/example_HIV_data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
The samples were taken from the publication Abrahams et al. (2019), Science translational medicine 11.513 (DOI: 10.1126/scitranslmed.aaw5589).

We download the following HIV Multiplexed Illumina MiSeq data from the short read archive (SRA): SRR9588830, SRR9588828, SRR9588844 and SRR9588785. They where taken from a HIV-1 positive patients at different time points post-infection.

To download the data, we used ``` sra-tools ```.

```bash
mkdir -p samples/CAP188/4/
cd samples/CAP188/4/
fastq-dump -O raw_data --split-e SRR9588828
```

Using the `--split-e` option, we download the reads seperatled into forward and reverse reads. Here you can find some more information on `fastq-dump`: https://edwards.flinders.edu.au/fastq-dump/

We aligned the reads to the HIV strain HXB2, retrieved reads covering the region HXB2:2453-3356, and further subsampled to have a feasible sized sample to run on a laptop.

```
samtools view REF_aln.bam -h "HXB2:2453-3356" > output_region.bam
samtools view -s 0.10 -b output_region.bam > output_region_subsample.bam
```
6,556 changes: 6,556 additions & 0 deletions docs/example_HIV_data/samples/CAP188/30/raw_data/readsEnv_R1.fastq

Large diffs are not rendered by default.

6,740 changes: 6,740 additions & 0 deletions docs/example_HIV_data/samples/CAP188/30/raw_data/readsEnv_R2.fastq

Large diffs are not rendered by default.

9,256 changes: 9,256 additions & 0 deletions docs/example_HIV_data/samples/CAP188/4/raw_data/readsEnv_R1.fastq

Large diffs are not rendered by default.

9,400 changes: 9,400 additions & 0 deletions docs/example_HIV_data/samples/CAP188/4/raw_data/readsEnv_R2.fastq

Large diffs are not rendered by default.

13,784 changes: 13,784 additions & 0 deletions docs/example_HIV_data/samples/CAP217/4390/raw_data/readsEnv_R1.fastq

Large diffs are not rendered by default.

18,752 changes: 18,752 additions & 0 deletions docs/example_HIV_data/samples/CAP217/4390/raw_data/readsEnv_R2.fastq

Large diffs are not rendered by default.

209 changes: 209 additions & 0 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
---
jupyter:
jupytext:
cell_metadata_filter: -all
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.1
kernelspec:
display_name: Python 3
language: python
name: python3
---


# V-Pipe Tutorial

V-pipe is a workflow designed for the analysis of next generation sequencing (NGS) data from viral pathogens. It produces a number of results in a curated format (e.g., consensus sequences, SNV calls, local/global haplotypes). V-pipe is written using the Snakemake workflow management system.

## Requirements

V-pipe is optimized for Linux or Mac OS systems. Therefore, we recommend users with a Windows system to install WSL2 - this is not a full virtual machine but rather a way to run Windows and Linux cooperatively at the same time.


## Organizing Data

V-Pipe takes as an input raw data in FASTQ format and depending on the user-defined configuration will output consensus sequences, SNV calls and local/global haplotypes.

V-pipe expects the input samples to be organized in a two-level hierarchy:

At the first level, input files are grouped by samples (e.g.: patients or biological replicates of an experiment).
At the second level, different datasets belonging to the same sample (e.g., from sample dates) are distinguished.
Inside the 2nd-level directory, the sub-directory `raw_data` holds the sequencing data in FASTQ format (optionally compressed with GZip).
Paired-ended reads need to be in split files with suffixes `_R1` and `_R2`.

```
samples
|───patient1
│ └───date1
│ └───raw_data
│ |───reads_R1.fastq
│ └───reads_R2.fastq
└───patient2
|───date1
| └───raw_data
| |───reads_R1.fastq
| └───reads_R2.fastq
└───date2
└───raw_data
|───reads_R1.fastq
└───reads_R2.fastq
```

## Preparing a small dataset

In the directory `example_HIV_data` you find a small test dataset that you can run on your workstation or laptop.
The files will have the following structure:

```
samples
|└───CAP217
│ └───4390
│ └───raw_data
│ |───reads_R1.fastq
│ └───reads_R2.fastq
└───CAP188
|───4
| └───raw_data
| |───reads_R1.fastq
| └───reads_R2.fastq
└───30
└───raw_data
|───reads_R1.fastq
└───reads_R2.fastq
```

## Install V-pipe

V-pipe uses the [Bioconda](https://bioconda.github.io/) bioinformatics software repository for all its pipeline components. The pipeline itself is implemented using [Snakemake](https://snakemake.readthedocs.io/en/stable/).

For advanced users: If your are fluent with these tools, you can:

* directly download and install [bioconda](https://bioconda.github.io/user/install.html) and [snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda),
* specifiy your V-pipe configuration, and start using V-pipe

Use `--use-conda` to [automatically download and install](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management) any further pipeline dependencies. Please refer to the documentation for additional instructions.

In this present tutorial you will learn how to setup a workflow for the example dataset.

To deploy V-pipe, you can use the installation script with the following parameters:

```bash
curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
bash quick_install.sh -p testing -w work
```

Note that

* using `-p` specifies the subdirectory where to download and install snakemake and V-pipe
* using `-w` will create a working directory and populate it. It will colloquial the references and the default `config/config.yaml`, and create a handy `vpipe` short-cut script to invoke `snakemake`.


If you get `zsh: permission denied: ./quick_install.sh`, run `chmod +x quick_install.sh` this gives the necessary permissions.

Tip: To create and populate other new working directories, you can call init_project.sh from within the new directory:

```console
mkdir -p working_2
cd working_2
../V-pipe/init_project.sh
```


## Preparation

Copy the samples directory you created in the step "Preparing a small dataset" to this working directory. You can display the directory structure with `tree sample`s or `find samples`.

```bash
mv ./samples ./testing/work/resources/
```

### Reference
If you have a reference sequences that you would like to use for read mapping and alignment, then add it to the `resources/reference/ref.fasta` directory. In our case, however, we will use the reference sequence HXB2 already provided by V-Pipe `V-pipe/resources/hiv/HXB2.fasta`.

### Preparing V-pipe's configuration

In the `work` directory you can find the file `config.yaml`. This is where the V-Pipe configuation should be specified. See [here] (https://github.com/cbg-ethz/V-pipe/tree/master/config#readme) for the documentation of the configuration. In this tutorial we are building our own configuration therefore `virus_base_config` will remain empty. Since we are working with HIV-1, V-Pipe is providing meta information that will be used for visualisation (metainfo_file and gff_directory).

```bash
general:
virus_base_config: ''
aligner: "bwa"
snv_caller: "shorah"
haplotype_reconstruction: "haploclique"

input:
reference: "{VPIPE_BASEDIR}/../resources/hiv/HXB2.fasta"
metainfo_file: "{VPIPE_BASEDIR}/../resources/hiv/metainfo.yaml"
gff_directory: "{VPIPE_BASEDIR}/../resources/hiv/gffs/"
datadir: "{VPIPE_BASEDIR}/../../work/resources/samples"
read_length: 301
samples_file: samples.tsv
paired: true

snv:
consensus: false

output:
snv: true
local: true
global: true
visualization: true
QA: false
diversity: true
```

Note: A YAML files use spaces as indentation, you can use 2 or 4 spaces for indentation, but no tab. There are also online YAML file validators that you might want to use if your YAML file is wrongly formatted.

## Running V-pipe


Before running check what will be executed:

```bash
cd ./testing/work/
./vpipe --dryrun
```

As this is your first run of V-pipe, it will also generate the sample collection table. Check `samples.tsv` in your editor.

Note that the samples you have downloaded have reads of length 301 only. V-pipe’s default parameters are optimized for reads of length 250. To adapt to the read length, add a third column in the tab-separated file as follows:

```bash
cat ./testing/work/samples.tsv
CAP217 4390 301
CAP188 4 301
CAP188 30 301
```

Always check the content of the `samples.tsv` file.

If you did not use the correct directory structure, this file might end up empty or some entries might be missing.
You can safely delete it and re-run with option `--dry-run` to regenerate it.

Finally, we can run the V-pipe analysis (the necessary dependencies will be downloaded and installed in conda environments managed by snakemake):

```bash
cd ./testing/work/
./vpipe -p --cores 2
```


## Output

The Wiki contains an overview of the output files. The output of the SNV calling step is aggregated in a standard [VCF](https://en.wikipedia.org/wiki/Variant_Call_Format) file, located in `samples/​{hierarchy}​/variants/SNVs/snvs.vcf`. You can open it with your favorite VCF tools for visualisation or downstream processing. It is also available in a tabular format in `samples/​{hierarchy}​/variants/SNVs/snvs.csv`.


## Swapping component

The default configuration uses ShoRAH to call the SNVs and to reconstruct the local (windowed) haplotypes.

Components of the pipeline can be swapped simply by changing the `config.yaml` file. For example to call SNVs using lofreq instead of ShoRAH use

```yaml
general:
snv_caller: lofreq
```
25 changes: 25 additions & 0 deletions resources/auxiliary_workflows/benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Quasispecies Reconstruction Benchmark

Benchmark Quasispecies assembly methods both on the level of local as well as global haplotypes.

## Usage

To run the workflow, execute the following:

```bash
# locally (remove docker part if on linux)
docker run --rm -v $PWD:/foo --workdir=/foo snakemake/snakemake:stable snakemake -prj1 --use-conda

# on cluster
./run_workflow.sh
```

## Adding new methods

To run a new method/tool as part of the benchmark workflow, add a script to `resources/method_definitions/`.
Each script must be classified as either `local` (produces a VCF file) or global (produces a FASTA file) by adding `# GROUP: local` or `# GROUP: global` respectively.
Method dependencies can be specified as comments.
Conda packages can be added by writing `# CONDA: <package name> = <version>`.
Analogously, PIP packages can be added by writing `# PIP: <package name>`.
Multiple packages can be added by repeating these lines.
A conda environment will then be dynamically generated (when running Snakemake with `--use-conda`).
13 changes: 13 additions & 0 deletions resources/auxiliary_workflows/benchmark/config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# `null` will execut all methods
# list of strings will execute selected methods
method_list: [haploclique, quasirecomb, predicthaplo, haploconduct, cliquesnv]

replicate_count: 10

haplotype_generation: distance # distance or mutation_rate

params_path: config/params.csv

master_seq_path: null
# if None, then generate MasterSequence by drawing bases
# uniformly at random using the user-provided genome length
10 changes: 10 additions & 0 deletions resources/auxiliary_workflows/benchmark/config/params.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# for distance mode:
# haplos = n_group1,n_group2,d_group12,d_group1,d_group2,freq_dist,freq_param
# for mutation mode:
# haplos = mutation_rate,insertion_rate,deletion_rate,haplotype_pattern
# parameters are seperated with "@"
seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
illumina,shotgun,,240,1000,100,5@5@20@10@6@[email protected]
# illumina,amplicon,400:100,200,1000,100,5@5@20@10@6@dirichlet@1:1:1:1:1:1:1:1:1:1
# illumina,shotgun,,240,1000,100,0.1@0@[email protected]:0.4
# illumina,amplicon,400:100,200,1000,100,0.1@0@[email protected]:0.4
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
method_list: [lofreq_local_haplo, shorah_default, shorah_mfa_qualities_unique_uniform, shorah_mfa_s1_a0.000001_relaxConv_uniform]
replicate_count: 1
haplotype_generation: mutation_rate
params_path: config_amplicon/params.csv
master_seq_path: /Users/lfuhrmann/Documents/Projects/V-pipe/resources/hiv/HXB2.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
illumina,amplicon,400:100,200,9720,1000,0.1@0@[email protected]:0.2:0.05:0.05
#illumina,amplicon,400:100,200,9720,1000,0.01@0@[email protected]:0.2:0.05:0.05
#illumina,amplicon,400:100,200,9720,1000,0.001@0@[email protected]:0.2:0.05:0.05
#illumina,amplicon,400:10,200,9720,1000,0.1@0@[email protected]:0.2:0.05:0.05
#illumina,amplicon,400:10,200,9720,1000,0.01@0@[email protected]:0.2:0.05:0.05
#illumina,amplicon,400:10,200,9720,1000,0.001@0@[email protected]:0.2:0.05:0.05
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
method_list: [lofreq_local_haplo, shorah_default_amplicon, shorah_mfa_qualities_unique, shorah_mfa_s1_a0.000001_relaxConv]
replicate_count: 1
haplotype_generation: distance
params_path: config_distance/params.csv
master_seq_path: null
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
illumina,shotgun,,201,201,100,5@5@20@5@10@[email protected]
illumina,shotgun,,201,201,1000,5@5@20@5@10@[email protected]
illumina,shotgun,,201,201,100,5@20@10@5@5@[email protected]
illumina,shotgun,,201,201,1000,5@20@10@5@5@[email protected]
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
method_list: [lofreq_local_haplo, shorah_default, shorah_mfa_qualities_unique_uniform, shorah_mfa_s1_a0.000001_relaxConv_uniform, shorah_mfa_stick_qualities_unique_uniform, shorah_mfa_qualities_unique, shorah_mfa_s1_a0.000001_relaxConv]
replicate_count: 1
haplotype_generation: distance
params_path: config_longreads/params.csv
master_seq_path: ../../hiv/HXB2.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
pacbio,shotgun,,9720,9720,500,5@5@200@50@100@[email protected]
pacbio,shotgun,,9720,9720,1000,5@5@200@50@100@[email protected]
pacbio,shotgun,,9720,9720,2000,5@5@200@50@100@[email protected]
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
method_list: [lofreq_local_haplo, shorah_default_amplicon, shorah_mfa_qualities_unique, shorah_mfa_s1_a0.000001_relaxConv]
replicate_count: 1
haplotype_generation: mutation_rate
params_path: config_mutationrate/params.csv
master_seq_path: null
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
illumina,shotgun,,201,201,100,0.1@0@[email protected]:0.2:0.05:0.05
illumina,shotgun,,201,201,100,0.01@0@[email protected]:0.2:0.05:0.05
illumina,shotgun,,201,201,1000,0.1@0@[email protected]:0.2:0.05:0.05
illumina,shotgun,,201,201,1000,0.01@0@[email protected]:0.2:0.05:0.05
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
method_list: [lofreq_local_haplo, shorah_default, shorah_mfa_qualities_unique_uniform, shorah_mfa_s1_a0.000001_relaxConv_uniform]
replicate_count: 1
haplotype_generation: null
params_path: config_realdata/params.csv
master_seq_path: null
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
#illumina,real_data,5-virus-mix@1,,,,
#illumina,real_data,[email protected],,,,
illumina,real_data,[email protected],,,,
illumina,real_data,[email protected],,,,
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# `null` will execut all methods
# list of strings will execute selected methods
method_list: [lofreq_local_haplo, shorah_mfa_qualities_unique_uniform, shorah_mfa_s1_a0.000001_relaxConv_uniform]

replicate_count: 1

haplotype_generation: null # distance or mutation_rate

params_path: config_realdata_CoV2/params.csv

master_seq_path: null
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
seq_tech,seq_mode,seq_mode_param,read_length,genome_size,coverage,haplos
illumina,amplicon,real_data,2-SARS-CoV-2-mix,C2_Wild_10_03,0.25,
illumina,amplicon,real_data,2-SARS-CoV-2-mix,E1_Wild_50_02,0.25,
illumina,amplicon,real_data,2-SARS-CoV-2-mix,G1_Wild_100_01,0.25,
illumina,amplicon,real_data,2-SARS-CoV-2-mix,H1_Wild_100_02,0.25,
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
__default__:
- "-R \"select[model==EPYC_7H12]\""
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env bash

bsub \
-N \
-R 'rusage[mem=5000]' \
-W 120:00 \
-oo snake.out -eo snake.err \
snakemake \
--profile lsf \
--rerun-incomplete \
-pr \
--cores 200 \
--use-conda \
--latency-wait 30 \
--show-failed-logs \
"$@"
Loading