-
Notifications
You must be signed in to change notification settings - Fork 9
9. Creating QIIME compatible reference databases
Download links are available at https://docs.qiime2.org/2023.9/data-resources
wget https://data.qiime2.org/2023.9/common/silva-138-99-seqs.qza
wget https://data.qiime2.org/2023.9/common/silva-138-99-tax.qza
qiime tools export --input-path silva-138-99-seqs.qza --output-path silva138.99.seq
qiime tools export --input-path silva-138-99-tax.qza --output-path silva138.99.tax
Format of silva138.99.tax/taxonomy.tsv
:
Feature ID Taxon
CP013078.2406498.2408039 d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__Bordetella; s__Bordetella_pertussis
CP015924.1224168.1225721 d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Salmonella; s__Salmonella_enterica
month="Jan2025"
cat mitofish.12S."$month".tsv | awk -F "\t" '{OFS="#"}{print $1,$11}' | grep -v "Accession" | grep -v "Sequence" | sed "s/^/>/" | tr "#" "\n" >mitofish.12S."$month".fasta
Some mitochondrial genomes contain more than 1 copy of the 12S rRNA gene. However, RESCRIPt does not process FASTA files with duplicated sequence headers. We use a python script fasta_dedup.py
to deduplicate the header lines of mitofish.12S."$month".fasta
by adding a numeric label to each duplicated header (e.g. _1). fasta_dedup.py
is in the db.scripts folder of the mitohelper repository
#!/usr/bin/env python
import os, glob
from os import path
import re
seen=set()
month=str("Jan2025")
with open("mitofish.12S."+month+".fasta", 'r') as fasta:
outfile=str("mitofish.12S."+month+"_NR.fasta")
output=open(outfile,'w')
for line in fasta.readlines():
if re.search(">",line):
header=line.rsplit("\n")
seqheader=header[0]
count=1
if seqheader in seen:
output.write(seqheader+"_"+str(count))
output.write("\n")
count=count+1
if seqheader not in seen:
seen.add(seqheader)
output.write(seqheader)
output.write("\n")
else:
seq=line.rsplit("\n")
output.write(seq[0])
output.write("\n")
fasta.close()
output.close()
Example output with truncated sequences
>ON185661
CAAGAGGTTTGGTCCTAGCCTTAGTATTAATT
>ON185661_1
GGTCCTAGCCTTAGTATTAATTGTAACTAGAAT
Perform a quick check on the distribution of sequence lengths, ambiguous bases, and homopolymers using Mothur.
mothur > summary.seqs(fasta=mitofish.12S.Jan2025_NR.fasta,processors=8)
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 52 52 0 3 1
2.5%-tile: 1 166 166 0 4 1524
25%-tile: 1 198 198 0 5 15236
Median: 1 625 625 0 6 30472
75%-tile: 1 950 950 0 6 45707
97.5%-tile: 1 1638 1638 0 9 59419
Maximum: 1 12294 12294 677 677 60942
Mean: 1 680 680 0 5
# of Seqs: 60942
It took 1 secs to summarize 60942 sequences.
Output File Names:
mitofish.12S.Jan2025_NR.summary
After deduplicating the fasta file, you can use an automated shell script
process_qiime2.sh
to create QIIME-compatible reference databasesprocess_qiime2.sh
is in the db.scripts folder of the mitohelper repository
- Extract list of accession numbers (feature IDs) from deduplicated fasta file
echo "Feature ID" >12S.featureID
grep ">" mitofish.12S."$month"_NR.fasta | tr -d ">" >>12S.featureID
- Get list of taxonomic names at each level from domain through species
cut -f6 mitofish.12S."$month".tsv | sed "s/^/d__Eukaryota; p__Chordata; c__/" | sed "s/$/;/" >12S.class
cut -f7 mitofish.12S."$month".tsv | sed "s/^/o__/" | sed "s/$/;/" >12S.order
cut -f8 mitofish.12S."$month".tsv | sed "s/^/f__/" | sed "s/$/;/" >12S.family
cut -f9 mitofish.12S."$month".tsv | sed "s/^/g__/" | sed "s/$/;/" >12S.genus
cut -f10 mitofish.12S."$month".tsv | sed "s/^/s__/" >12S.species
# Combine all taxonomic name columns
paste -d " " 12S.class 12S.order 12S.family 12S.genus 12S.species | sed "s/.*f__Family.*$/Taxon/" >12S.taxon
# Merge feature ID and taxon information
paste -d "\t" 12S.featureID 12S.taxon >12S.taxonomy.tsv
- Format of
12S.taxonomy.tsv
:
Feature ID Taxon
AB006953 d__Eukaryota; p__Chordata; c__Actinopteri; o__Cypriniformes; f__Cyprinidae; g__Carassius; s__Carassius langsdorfii
AB015962 d__Eukaryota; p__Chordata; c__Chondrichthyes; o__Carcharhiniformes; f__Triakidae; g__Mustelus; s__Mustelus manazo
Import sequences:
qiime tools import --type 'FeatureData[Sequence]' --input-path mitofish.12S.$month.fasta --output-path 12S-seqs.qza
Import taxonomy:
qiime tools import --type 'FeatureData[Taxonomy]' --input-path 12S.taxonomy.tsv --output-path 12S-tax.qza
For this, we use RESCRIPt, which is a python package and QIIME 2 plugin for formatting, managing, and manipulating sequence reference databases.
- First, remove sequences with >=5 ambiguous bases and homopolymers>=8bp long:
qiime rescript cull-seqs --i-sequences 12S-seqs.qza --o-clean-sequences 12S-seqs-cleaned.qza
- We skip the sequence length filtering step and dereplicate cleaned sequences in
uniq
mode to retain identical sequence records with differing taxonomies:
qiime rescript dereplicate --i-sequences 12S-seqs-cleaned.qza --i-taxa 12S-tax.qza --p-mode 'uniq' --o-dereplicated-sequences 12S-seqs-derep-uniq.qza --o-dereplicated-taxa 12S-tax-derep-uniq.qza
- Export cleaned and dereplicated 12S data:
qiime tools export --input-path 12S-seqs-derep-uniq.qza --output-path 12S.seq
qiime tools export --input-path 12S-tax-derep-uniq.qza --output-path 12S.tax
- Concatenate 12S+SILVA (16S & 18S) data:
head -1 12S.tax/taxonomy.tsv >tax.header
cat 12S.tax/taxonomy.tsv silva138.99.tax/taxonomy.tsv | grep -v "Feature ID" >SSU.taxonomy
cat tax.header SSU.taxonomy >SSU.taxonomy.tsv
cat 12S.seq/dna-sequences.fasta silva138.99.seq/dna-sequences.fasta >SSU.fasta
- Import 12S+SILVA (16S & 18S) data into QIIME 2
qiime tools import --type 'FeatureData[Sequence]' --input-path SSU.fasta --output-path 12S-16S-18S-seqs.qza
qiime tools import --type 'FeatureData[Taxonomy]' --input-path SSU.taxonomy.tsv --output-path 12S-16S-18S-tax.qza