Skip to content

9. Creating QIIME compatible reference databases

shenjean edited this page Jan 14, 2025 · 55 revisions

Download QIIME-compatible SILVA NR99 full-length sequences and taxonomy

Download links are available at https://docs.qiime2.org/2023.9/data-resources

wget https://data.qiime2.org/2023.9/common/silva-138-99-seqs.qza
wget https://data.qiime2.org/2023.9/common/silva-138-99-tax.qza
qiime tools export --input-path silva-138-99-seqs.qza --output-path silva138.99.seq
qiime tools export --input-path silva-138-99-tax.qza --output-path silva138.99.tax

Format of silva138.99.tax/taxonomy.tsv:

Feature ID	Taxon
CP013078.2406498.2408039	d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__Bordetella; s__Bordetella_pertussis
CP015924.1224168.1225721	d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Salmonella; s__Salmonella_enterica

Generate FASTA file of 12S rRNA gene sequences from tab-separated reference file

month="Jan2025"
cat mitofish.12S."$month".tsv | awk -F "\t" '{OFS="#"}{print $1,$11}' | grep -v "Accession" | grep -v "Sequence" | sed "s/^/>/" | tr "#" "\n" >mitofish.12S."$month".fasta

Some mitochondrial genomes contain more than 1 copy of the 12S rRNA gene. However, RESCRIPt does not process FASTA files with duplicated sequence headers. We use a python script fasta_dedup.py to deduplicate the header lines of mitofish.12S."$month".fasta by adding a numeric label to each duplicated header (e.g. _1). fasta_dedup.py is in the db.scripts folder of the mitohelper repository

#!/usr/bin/env python

import os, glob
from os import path
import re

seen=set()
month=str("Jan2025")

with open("mitofish.12S."+month+".fasta", 'r') as fasta:
        outfile=str("mitofish.12S."+month+"_NR.fasta")
        output=open(outfile,'w')
        for line in fasta.readlines():
                if re.search(">",line):
                        header=line.rsplit("\n")
                        seqheader=header[0]
                        count=1

                        if seqheader in seen:
                                output.write(seqheader+"_"+str(count))  
                                output.write("\n")
                                count=count+1
        
                        if seqheader not in seen:
                                seen.add(seqheader)
                                output.write(seqheader)
                                output.write("\n")
                else:
                        seq=line.rsplit("\n")
                        output.write(seq[0])
                        output.write("\n")
        fasta.close()
        output.close()

Example output with truncated sequences

>ON185661
CAAGAGGTTTGGTCCTAGCCTTAGTATTAATT
>ON185661_1
GGTCCTAGCCTTAGTATTAATTGTAACTAGAAT

Perform a quick check on the distribution of sequence lengths, ambiguous bases, and homopolymers using Mothur.

mothur > summary.seqs(fasta=mitofish.12S.Jan2025_NR.fasta,processors=8)

Using 8 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	52	52	0	3	1
2.5%-tile:	1	166	166	0	4	1524
25%-tile:	1	198	198	0	5	15236
Median: 	1	625	625	0	6	30472
75%-tile:	1	950	950	0	6	45707
97.5%-tile:	1	1638	1638	0	9	59419
Maximum:	1	12294	12294	677	677	60942
Mean:	1	680	680	0	5
# of Seqs:	60942

It took 1 secs to summarize 60942 sequences.

Output File Names:
mitofish.12S.Jan2025_NR.summary

After deduplicating the fasta file, you can use an automated shell script process_qiime2.sh to create QIIME-compatible reference databases process_qiime2.sh is in the db.scripts folder of the mitohelper repository

Create 12S taxonomy table in SILVA format

  • Extract list of accession numbers (feature IDs) from deduplicated fasta file
echo "Feature ID" >12S.featureID
grep ">" mitofish.12S."$month"_NR.fasta | tr -d ">" >>12S.featureID
  • Get list of taxonomic names at each level from domain through species
cut -f6 mitofish.12S."$month".tsv | sed "s/^/d__Eukaryota; p__Chordata; c__/" | sed "s/$/;/" >12S.class
cut -f7 mitofish.12S."$month".tsv | sed "s/^/o__/" | sed "s/$/;/" >12S.order
cut -f8 mitofish.12S."$month".tsv | sed "s/^/f__/" | sed "s/$/;/" >12S.family
cut -f9 mitofish.12S."$month".tsv | sed "s/^/g__/" | sed "s/$/;/" >12S.genus
cut -f10 mitofish.12S."$month".tsv | sed "s/^/s__/" >12S.species

# Combine all taxonomic name columns
paste -d " " 12S.class 12S.order 12S.family 12S.genus 12S.species | sed "s/.*f__Family.*$/Taxon/" >12S.taxon
# Merge feature ID and taxon information
paste -d "\t" 12S.featureID 12S.taxon >12S.taxonomy.tsv
  • Format of 12S.taxonomy.tsv:
Feature ID	Taxon
AB006953	d__Eukaryota; p__Chordata; c__Actinopteri; o__Cypriniformes; f__Cyprinidae; g__Carassius; s__Carassius langsdorfii
AB015962	d__Eukaryota; p__Chordata; c__Chondrichthyes; o__Carcharhiniformes; f__Triakidae; g__Mustelus; s__Mustelus manazo

Import 12S sequences and taxonomy into QIIME 2

Import sequences:

qiime tools import --type 'FeatureData[Sequence]' --input-path mitofish.12S.$month.fasta --output-path 12S-seqs.qza

Import taxonomy:

qiime tools import --type 'FeatureData[Taxonomy]' --input-path 12S.taxonomy.tsv --output-path 12S-tax.qza

Dereplicate 12S sequences and taxonomy

For this, we use RESCRIPt, which is a python package and QIIME 2 plugin for formatting, managing, and manipulating sequence reference databases.

  • First, remove sequences with >=5 ambiguous bases and homopolymers>=8bp long:
qiime rescript cull-seqs --i-sequences 12S-seqs.qza --o-clean-sequences 12S-seqs-cleaned.qza
  • We skip the sequence length filtering step and dereplicate cleaned sequences in uniq mode to retain identical sequence records with differing taxonomies:
qiime rescript dereplicate --i-sequences 12S-seqs-cleaned.qza --i-taxa 12S-tax.qza --p-mode 'uniq' --o-dereplicated-sequences 12S-seqs-derep-uniq.qza --o-dereplicated-taxa 12S-tax-derep-uniq.qza

Combine dereplicated 12S data with SILVA data

  • Export cleaned and dereplicated 12S data:
qiime tools export --input-path 12S-seqs-derep-uniq.qza --output-path 12S.seq
qiime tools export --input-path 12S-tax-derep-uniq.qza --output-path 12S.tax
  • Concatenate 12S+SILVA (16S & 18S) data:
head -1 12S.tax/taxonomy.tsv >tax.header
cat 12S.tax/taxonomy.tsv silva138.99.tax/taxonomy.tsv | grep -v "Feature ID" >SSU.taxonomy
cat tax.header SSU.taxonomy >SSU.taxonomy.tsv
cat 12S.seq/dna-sequences.fasta silva138.99.seq/dna-sequences.fasta >SSU.fasta
  • Import 12S+SILVA (16S & 18S) data into QIIME 2
qiime tools import --type 'FeatureData[Sequence]' --input-path SSU.fasta --output-path 12S-16S-18S-seqs.qza
qiime tools import --type 'FeatureData[Taxonomy]' --input-path SSU.taxonomy.tsv --output-path 12S-16S-18S-tax.qza