Single Nucleotide Polymorphism finder

Given a reference nucleotide sequence $R$ and a long read $r$, this algorithm finds - if it does exist - the position $i$ on $r$ such that $r$ matches a substring of $R$, except for $r_i$.

This solution was built on top of the concept of Spaced Seed.

Compilation

Install unordered_dense;
compile ntHash by setting as prefix the root of the repository (see ntHash docs for further information);

run the following:

mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ../
make

Usage

./spacedSeeds run <reference FASTA> <read FASTA>

There are some optional arguments which can be used to set custom parameters:

--k sets the length of the spaced seeds in the second phase of the algorithm
--firstK sets the length of the exact k-mers which will be extracted during the first part of the algorithm
--bloomFilterThreshold sets the probability of the bloom filter of encountering a false positive
--firstThreshold sets the threshold under which the areas are not taken into account for the second part of the algorithm.

How does it work

The algorithm is designed to follow these steps:

Approximate identification of possible areas of match. This step is crucial for the algorithm performance, since it allows to exclude almost all of $R$ from the next step, improving drastically the speed on real data, requiring much less memory and time. It makes use of a bloom filter containing informations on the existence or not of the $R$ exact k-mers. This step requires $\mathcal{O}(|R|)$ time. In more detail, it proceeds as follows:
1. $R$ is split into subsequences of length $\left\lceil \frac{|r|}{2} \right\rceil$
2. A bloom filter containing the exact k-mers $K_r$ of $r$ is constructed
3. For each partition $p$ of $R$ with exact k-mers $K_p$, the metric $\textrm{sim}(b, r) = \frac{| K_p \cap K_r |}{|K_p|}$ is computed and it is decided whether to consider it or not a candidate partition by using a threshold $\tau$.
Matching between $r$ and the candidate areas of $R$. At this moment, each candidate partition is considered until a match is found. More specifically, the algorithm works as follows:
1. The hashes of the spaced seeds of $r$ are extracted. Not all the possible seeds are requested to be saved since they would represent redundant information, hence only the ones having the "do not care" symbol are considered for the extraction, except for the last $k$ ones, which are needed to complete the set of possible match positions.
2. An index for $r$ is constructed
3. For each partition $i$ of $R$ with similarity $\geq \tau$, an index representing $i$ and the adjacent positions is constructed and used to find the match position by means of the seed hashes contained in both the indices of $i$ and $r$. Note that only the spaced seeds in common with $r$ are saved in the $i$ index. Moreover, the match visits the seeds from the least frequent in $i$ and adjacents to the most frequent.

Overall structure of the indices used during the matching phase.

Benchmarks

Benchmarks on reads with at least one match. Reference extracted from the GRCh38 genome.

Benchmarks on reads with no matches. Reference extracted from the GRCh38 genome.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
bloom @ 7309d71		bloom @ 7309d71
img		img
ntHash @ 4e8a736		ntHash @ 4e8a736
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single Nucleotide Polymorphism finder

Compilation

Usage

How does it work

Benchmarks

Credits

About

Releases

Packages

Languages

License

gabrielecanesi/DNA-SNP-finder

Folders and files

Latest commit

History

Repository files navigation

Single Nucleotide Polymorphism finder

Compilation

Usage

How does it work

Benchmarks

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages