UPDATED for R9 refactor Updated 12/19/16 - Working to integrate new matrichor base caller, and multi-contig reference sequences
SignalAlign is a hidden Markov model (HMM) software package for aligning the ionic current signal from the Oxford Nanopore Technologies (ONT) MinION to a reference sequence and inferring properties of the reference sequence.
- Recursively clone this repo
git clone --recursive https://github.com/ArtRand/signalAlign.git
- Make a virtual environment in the directory
virtualenv venv
- Activate it
. venv/bin/activate
- Pip install the requirements
pip install -r requirements.txt
- run
make
- Test the installation by running
make test
- All of the programs can be found in the
/bin/
directory - If python can't find the modules in the repo, add the directory to the PYTHONPATH like so:
export PYTHONPATH="$PYTHONPATH:$(pwd)"
- Install
cd bin
- Run:
./runSignalAlign -d /path/to/minionReads/ -r /path/to/reference.fasta -o /path/to/output/ 2> /dev/null
- If you want to try the test E. coli files run:
./runSignalAlign -d ../tests/minion_test_reads/ecoli/ -r ../tests/test_sequences/E.coli_K12.fasta -o ../tests/ 2> ../tests/test_run.err
- runSignalAlign
- Aligns ionic current events from a directory of basecalled MinION reads (.fast5) to a reference sequence. With appropriate inputs, characters in the reference sequence can be flagged as ambiguous, meaning multiple bases will be probabilistically aligned to a position. Right now, the program the program supports aligning cytosine variants (5-mC and 5-hmC) and adenine variants (6-mA) to a position.
- trainModels
- Trains the transitions and/or emissions of the HMM. Uses a directory of basecalled reads and a reference sequence to learn transition parameters to the model. Once enough assignments have been accumulated it will (optionally) rebuild a hierarchical Dirichlet process (HDP) from these assignments.
- hdp_pipeline
- Recommended method for building HDP models. Uses a directory of alignments to make a new HDP model, also see the pipeline examples for detailed instructions.
Required
- A directory of MinION reads,
-d
, (*.fast5) that have been basecalled. Currently, the following versions of Metrichor basecalled files are supported: 1.15.0, 1.19.0, 1.20.0, 1.22.2, 1.22.4. If you have a more recent or unsupported version open an issue and I'll modify the program (or feel free to implement it yourself and issue a pull request!). - A reference sequence,
-r
, in FASTA format. - Output location,
-o
, a path to use as working directory. A new directory will be made here, so the program won't pollute this directory.
Optional
- A file containing trained HMM transitions parameters
-T
(template HMM)-C
(complement HMM). - A file containing a HDP model or other emissions (normal distributions) parameters
-tH
(template HDP)-cH
(complement HDP). - Target regions file. Only reads that map to these regions will follow on to event-alignment
-q
- Ambiguity positions file,
-p
flags positions to be aligned to multiple bases (variant calling).
--in_tempalte_hmm
,-T
template HMM parameters file--in_complement_hmm
,-C
complement HMM parameters file--template_hdp
,-tH
template HDP model file--complement_hdp
,-cH
complement HDP model file--degenerate
,-x
nucleotide options for degenerate or ambiguous positions.variant
= {A,C,G,T}cytosine2
= {CE}cytosine3
= {CEO}adenosine
= {AI}. n.b. E = 5-methylcytosine, O = 5-hydroxymethylcytosine, I = 6-methyladenine--stateMachineType
,-smt
HMM to use. Options:threeState
andthreeStateHdp
. Default:threeState
.--file_of_files
,-fofn
a file containing the absolute path to files to align with, one file path per line--threshold
,-t
. Minimum posterior match probability threshold (matches below this threshold will not be tabulated). Default: 0.01.--diagonalExpansion
,-e
Mumber of diagonals to expand around each anchor, Default: 50.--constraintTrim
,-m
Amount to remove from an anchor constraint. Default: 14.--target_regions
,-q
File containing target regions to align to, if the read doesn't get mapped to a region in this file the signal-level alignment will not precede. The format isstart \t end \t kmer
wherestart
andend
are the genomic coordinates and thekmer
is the sequence at those coordinates. Thekmer
is mostly historical, so it doesn't have to be perfect. See below for details.--ambiguity_positions
,p
file containing positions to flag as ambiguous see below for details.--jobs, -j
Number of jobs to run concurrently, Default: 4.--nb_files
,-n
Maximum number of files to align, will randomly select if you point to a directory/fofn with more files than this so if you want to make sure you use all of the files set a large number (or count the lines in your fofn). Default: 500.--ambig_char
,-X
in development, will be for specifying specific subsets of ambiguous positions leave as default--output_format
,-f
format of output (this is different than the summary that goes tostdout
). Options are:full
,variantValler
, andassignments
, see below for description.--output_location
,-o
place to put results, it will make a new working directory in this location--error_correct
, not implemented, yet.
There are three output formats. full
, variantCaller
, and assignments
. Each read that aligns to the target regions (if you specified that option) will come as a separate file.
full
has the following tab-separated-format:
Contig | Reference Index | Reference k-mer | Read File | Strand | Event Index | Event Mean | Event Noise | Event Duration | Aligned k-mer | Scaled Mean Current | Scaled Noise | Posterior Probability | Descaled Event Mean | Model (ONT) Mean | Path k-mer |
---|
variantCaller
has the following tab-separated format:
Event Index | Reference Position | Base | Posterior Probability | Strand | Forward Mapped | Read File |
---|
finally, assignments
has the following tab-separated format:
k-mer | Read File | Descaled Event Mean | Posterior Probability |
---|
Required
- A directory of MinION reads,
-d
, (*.fast5) that have been basecalled. - A reference sequence,
-r
, in FASTA format. - Output location,
-o
, a path to use as working directory. A new directory will be made here, so the program won't pollute this directory.
Optional
- A file containing trained HMM transitions parameters
-T
(template HMM)-C
(complement HMM). - A file containing a HDP model or other emissions (normal distributions) parameters
-tH
(template HDP)-cH
(complement HDP).
--stateMachineType
,-smt
HMM to use. Options:threeState
andthreeStateHdp
. Default:threeState
.--file_of_files
,-fofn
a file containing the absolute path to files to align with, one file path per line--iterations, -i
number of iterations to perform. Default: 10.--train_amount, -a
batch size in bases. Default: 15 kb (15000)--emissions
flag, train emissions. Default: false (see note below about training HDP emissions)--transitions
flag, train transitions--threshold, -t
minimum posterior match probability threshold (matches below this threshold will not be tabulated). Default: 0.01. If you're just training the transitions (recommended) this won't matter.--diagonalExpansion, -e
number of diagonals to expand around each anchor, Default: 50.--constraintTrim, -m
amount to remove from an anchor constraint. Default: 14.--in_tempalte_hmm
,-T
template HMM parameters file--in_complement_hmm
,-C
complement HMM parameters file--template_hdp
,-tH
template HDP model file--complement_hdp
,-cH
complement HDP model file--jobs, -j
Number of jobs to run concurrently, Default: 4.--test
used for internal CI test--ambiguity_positions
,p
file containing positions to flag as ambiguous see below for details.--ambig_char
,-X
base to substitute in at positions indicated in positions file this is used for supervised learning, we change the base at the positions for signal-level alignment. example-X C E
would mean substitute the positions in the first batch of files withC
and substitute the second batch withE
. This way we can train the model on multiple base modifications at once. HDP training options It's recommended to use the pipeline described below instead of these options.--verbose
flag, verbose Gibbs sampling output, default: False--samples, -s
number of samples to collect during Gibbs sampling--thinning, -th
number of samples to thin (pick everyx
samples)--min_assignments
Do not initiate Gibbs sampling unless this many assignments have been accumulated
Trained models, within the working /tempFiles_expectations/
will have:
template_trained.hmm
template HMM with trained parameterscomplement_trained.hmm
complement HMM with trained parameters If HDPs were used, a copy of the input HDP will also be here, they have.nhdp
suffix.
- Recommended use a build alignment. There are scripts in the /signalAlign/scripts folder and examples in the HDP_models repo to do this.
- Alternatively, you can use alignments in
full
format. In the case of using plain alignments,-C
,-mC
, and-hmC
will globally change all of the cytosine bases to methyl or hydroxy-methyl cytosines. You may specify just-C
if you don't want to label methylated cytosines. This can be used for canonical base training.
--number_of_assignments
,-n
number of k-mer/event mean assignments to collect for each new label (C, mC, hmC).--build_alignment
a concatenated (multi read) alignment in thefull
format to extract assignments from. This is the recommended way to use this program.--threshold, -t
only take assignments (from the alignment) with posterior probability >= to this--hdp_type
HDP type Options:Prior
,Fixed
,twoWay
Default:Prior
.twoWay
is a Prior-type model that uses a gamma-distribution over the hyperparameters (concentration parameters) of the HDP.--template_model, -tM
Template lookup table for priors on k-mers--complement_model, -cM
Template lookup table for priors on k-mers-B
,-M
,-L
Base, middle, and leaf concentration parameters, respectively. Default: 1-Ba
,-Bb,
,-Ma
,-Mb
,-La
,-Lb
Base alpha(a
)/beta(b
) parameters for gamma distribution prior over concentration parameters. Default: 1.--samples, -s
number of samples to collect during Gibbs sampling. Default: 10000.--thinning, -th
number of samples to thin (pick everyx
samples). Default: 100.--verbose
flag, verbose Gibbs sampling output, default: False.--grid_start
sampling grid start (in pA). Default: 30 pA.--grid_end
sampling grid end (in pA). Default: 90 pA.--grid_length
number of nodes in the sampling grid. Default 1200.--out, -o
output location
Trained HDP models. Script will make all of the models for 3-way classification and the singleLevel and multiset models for 2-way classification.
Example command:
./hdp_pipeline --build_alignment=../../HDP_models/ecoli_models/buildAlignment_PCR_andLabeled_b1.tsv \
> -tM ../models/testModel_template.model \
> -cM ../models/testModel_complement.model \
> -Ba 1 -Bb 1 -Ma 1 -Mb 1 -La 1 -Lb 1 -s 15000 --verbose --hdp_type twoWay \
> -o ./hdp/
2> ./hdp/a.err
signalAlign uses two files to map methylation in a reference sequence. First it uses a target file that specifies positions in the reference, only reads that map to regions specified in the file will be aligned on the signal-level. The target file has format: start \t end \t sequence
. An example can be seen here:/signalAlign/tests/test_regions/test_sites_bal_1.tgt
. The second file is a label file that tells signalAlign which bases to flag as ambiguous. This file has the format X \t position \t ... \n
one line for both the template and complement strand. An example is at /signalAlign/tests/test_regions/test_labels_bal_1.tsv
.
An example command that would produce methylation call probabilities in E. coli (n.b. to run this example, download the HDP models from the HDP_models repo)
./runSignalAlign \
> -d ../tests/minion_test_reads/ecoli/ \
> -r ../tests/test_sequences/E.coli_K12.fasta \
> -T ../../HDP_models/ecoli_r7.3_models/template_trained.hmm \
> -C ../../HDP_models/ecoli_r7.3_models/complement_trained.hmm \
> -tH ../../HDP_models/ecoli_r7.3_models/template.multisetPrior2.nhdp \
> -cH ../../HDP_models/ecoli_r7.3_models/complement.multisetPrior2.nhdp \
> -x cytosine2 \
> -smt=threeStateHdp \
> -q ../tests/test_regions/test_sites_bal_1.tgt \
> -p ../tests/test_regions/test_labels_bal_1.tsv \
> -f variantCaller \
> -o ../../ \
> 2> ../../a.err
Pipeline scripts that correctly run all of these programs can be found at the signalAlign_notebook repo. Specifically, BuildModels.py
takes as input the pcr-amplified reads, genomic reads, a reference, and will run the full EM pipeline on all GATC
or CCWGG
positions in your reference. There are, of course, other options too.