Skip to content

npcooley/AAClassification

Repository files navigation

Accurate Annotation of Protein Sequences with IDTAXA

Table of contents:

Introduction

This repository contains scripts for reproducing figures in the manuscript Accurate Annotation of Protein Sequences with IDTAXA. For the sake of reproducibility, these scripts do not assume any particular parallelization schemes or schedulers. Processes are run sequentially within scripts to make them as broadly accessible as possible. These scripts should work in both Linux or Mac environments, and have not been tested on Windows.

All work for this manuscript was performed either on a 2019 MacBook Pro with a 2.6 GHz 6-core i7 processor with 64GB RAM, or run on the open science grid requesting nodes with at most 6 CPUs, 32GB of RAM and 4GB of disk.

These scripts are designed to work with input files that are present on zenodo. Outputs of these scripts used in the construction of figures are also present on zenodo.

Requirements

These scripts have a few minimal requirements:

  1. R 4.0.0 or greater
  2. DECIPHER 2.19.0 or greater
  3. BLAST 2.10.1 or greater
  4. HMMER 3.3.1 or greater
  5. R must have access to both HMMER and BLAST
  6. gridExtra is required to plot tables in the ErrorPlot.R script
  7. plotrix is required to plot figures in the CompareConf.R script

Usage

5 RScripts are provided in this repo, they should be run directly through the Rscript command in terminal and as long as R has access to HMMER and BLAST when called upon in this way, and the required packages are installed, they should run to completion. For reference the approximate run times of the three main scripts on a 2019 Macbook Pro with 6 cores and 64 GB of RAM in R 4.1.0, with HMMER 3.3.1 and BLAST 2.10.1, on the training data provided are:

  1. ICrossVal.R ~ 13 hours
  2. BCrossVal.R ~ for AAs 12-18 hours, for NTs 2-10 minutes (sw_traceback is used for AAs)
  3. HCrossVal.R ~ 48 hours+ - a large portion of this is model construction

These times are meant to serve only as a guidepost, your own mileage may vary.

The script ErrorPlot.R will take in results generated by any of these scripts and plot them in a manner similar to what is present in the manuscript. ErrorPlot.R will load in any RData file provided to it, but makes a few choices about the legends and colors for plotting based on what objects are present in the files, and the naming scheme of files.

These scripts were written to be minimally reproducible, they are not efficient computing. They do not assume anything beyond base R, DECIPHER v 2.19.0 (or greater), HMMER, and BLAST. All of these scripts can be sped up considerably by any number of parallelization regimes.

The cross validation (CV) scripts require an input with three objects:

> load(file = "<yourtrainingdata.RDATA>", verbose = TRUE)
Loading objects:
  Holdouts
  TaxVectors
  Seqs

Holdouts is a list of lists of integers, each integer represents a sequence to be used as a holdout. Holdouts were selected by randomly selecting a representative sequence to be held out from randomly selected labels within the taxonomy. This means that each label within the taxonomy can have at max 1 holdout. Holdouts were attempted to be forced to be as balanced as possible between singletons and non-singletons, but singletons at the KO level are rare in the KEGG database. A target of 1000 non-singletons and 1000 singletons was the goal for each CV, though at the KO level there are significantly fewer singleton holdouts per CV. Each lower level list is a full holdout set for cross validation, each upper level list represents all CVs for a given level of the assigned taxonomy of the sequences described in Seqs. In the provided data this taxonomy is:

  1. KEGG Orthology Group
  2. KEGG Taxonomy level 1
  3. KEGG Taxonomy level 2
  4. KEGG Taxonomy level 3
  5. KEGG Taxonomy level 4

Sequences that had incomplete taxonomic information supplied had the lowest supplied level replicated downward to enforce a square taxonomy. Because of the uneven distribution of labels, holdouts at the KO level are dominated by non-singletons and almost every singleton is included in one of the ten CVs. As singletons become more common down the taxonomic levels, the distribution of singleton / non-singleton eventually becomes 1:1 in each CV at 1000 sequences each, for 2000 holdout sequences per CV. Only the first and fifth taxonomy levels are presented within the manuscript.

> str(Holdouts[1])
List of 1
 $ :List of 10
  ..$ : int [1:1049] 2045 2406 3723 9374 15295 15410 15875 16192 17910 18895 ...
  ..$ : int [1:1049] 1898 2712 8640 10875 12091 13436 14336 14782 19256 21305 ...
  ..$ : int [1:1049] 742 1180 6153 6563 8558 8875 8959 9611 11743 13073 ...
  ..$ : int [1:1049] 164 696 1353 2107 3518 4409 4977 6200 7192 8297 ...
  ..$ : int [1:1049] 89 910 3607 5609 7284 7569 10100 11274 13751 14549 ...
  ..$ : int [1:1049] 1475 4173 6113 8002 12129 17409 17854 21543 24655 25440 ...
  ..$ : int [1:1049] 585 2511 3010 4267 6679 9075 10545 12898 13251 13519 ...
  ..$ : int [1:1049] 1286 3448 4587 5350 6371 7726 8384 8424 9276 10165 ...
  ..$ : int [1:1049] 852 3357 4339 5174 6851 8750 10309 10715 10952 11485 ...
  ..$ : int [1:1049] 3103 3262 5038 9571 11890 16417 18044 18635 19204 23481 ...

TaxVectors is a list of character vectors providing the taxonomic assignments for each sequence in Seqs at each taxonomic level. Technically it is redundant as TaxVectors[[1]] is a character vector of substrings of TaxVectors[[2]], etc, but the vectors are long enough that having them explicitly prebuilt is useful.

> str(TaxVectors)
List of 5
 $ : chr [1:1672354] "Root; K00844" "Root; K00844" "Root; K00844" "Root; K00844" ...
 $ : chr [1:1672354] "Root; K00844; Eukaryotes" "Root; K00844; Eukaryotes" "Root; K00844; Eukaryotes" "Root; K00844; Eukaryotes" ...
 $ : chr [1:1672354] "Root; K00844; Eukaryotes; Arthropods" "Root; K00844; Eukaryotes; Animals" "Root; K00844; Eukaryotes; Animals" "Root; K00844; Eukaryotes; Animals" ...
 $ : chr [1:1672354] "Root; K00844; Eukaryotes; Arthropods; Insects" "Root; K00844; Eukaryotes; Animals; Arthropods" "Root; K00844; Eukaryotes; Animals; Vertebrates" "Root; K00844; Eukaryotes; Animals; Arthropods" ...
 $ : chr [1:1672354] "Root; K00844; Eukaryotes; Arthropods; Insects; Insects" "Root; K00844; Eukaryotes; Animals; Arthropods; Chelicerates" "Root; K00844; Eukaryotes; Animals; Vertebrates; Reptiles" "Root; K00844; Eukaryotes; Animals; Arthropods; Insects" ...

Seqs is an AAStringSet or DNAStringSet of sequences, each with a unique name to use as a test set for these methods. Both AAs and NTs were used in this manuscript and are supplied for transparency. The supplied sequences were webscraped from KEGG along with all of their relevant metadata.

> Seqs
AAStringSet object of length 1672354:
          width seq                                                                 names               
      [1]   471 MVVPAEHALHEAIQVSPLILSDDVRRQKIENR...GRKFKLIHAEDGSGKGAALVSAIAHRLKKRLQ K00844.aec:105150471
      [2]   454 MSYSIELSKIEAITRNLVLTNDQLEKVASLLL...HPEYKFDLMLSEDGSGRGAALVAAVAVRQPTR K00844.tut:107367269
      [3]   889 MRLSDDVLLDVMKRFQAEMVKGLGRDTNPTAT...CDVTFLLSEDGSGKGAALITAVAKRMHSAAEH K00844.gja:107114311
      [4]   459 MQKYDPETDFPQAWKILKPFMLSDDALVTIRN...FEIIASDDSPGVGAAIVAGLTSTIKQKKIFIT K00844.dnv:115565483
      [5]   518 MEEENQAKRLFDLYEEYFFLSNLKLLELVDDF...ESGYLGSIHFYESDDGSGRGAAILASTTVNTH K00844.cpv:cgd6_3800
      ...   ... ...
[1672350]   315 MPKQTKHAFFGEPAGVADGDWFPSHRALYEAG...PGHAIDVTYVRQHRSRWASRDSAPDSPAGAHA K07454.strc:AA958...
[1672351]   239 MCGKEFETPIIKPKACKKFIADFGNTFIAAEK...FEGKKLRFIELKPSMFALEVRWRLFCRQKSQR K07454.csg:Cylst_...
[1672352]   259 MYTQKILQQKISNITIWRKGDNRAPHKPLLLL...FDMGAISIDDNMKLLVSSGVNGNQMVEQVFKL K07454.pmr:PMI3507
[1672353]   260 MQYFYAYHGPKNTQDFNYRNGYGVGQEWKIRQ...ILSESARLEPRYKDIHDTELRVPLERVNLSVK K07454.ler:GNG29_...
[1672354]   296 MDDNAWLERVARINQHRGGGERAPHKPLLILY...VRAPQLGQPRIADEHSAWHRDQVFRAPERVAG K07454.abai:IMCC2...
> head(names(Seqs))
[1] "K00844.aec:105150471" "K00844.tut:107367269" "K00844.gja:107114311" "K00844.dnv:115565483"
[5] "K00844.cpv:cgd6_3800" "K00844.onl:100693640"

Users may replicate the work performed in the manuscript either using the supplied original data or from the supplied cross validation results with only the plotting script.

ICrossVal.R

The script ICrossVal.R takes in training data that consists of sequences, taxonomies, and holdout designations and saves off an RData file that contains the objects IPerf which details the performance of IDTaxa on the training set, and IRes the raw result of IDTAXA's cross validations.

The script requires four trailing arguments, though the fourth is ignored when the input data contains nucleotide sequences.

$ Rscript </path/to/script/ICrossVal.R> <Input.RData> <integer, 1-5> <YourWellNamedOutput.RData> <integer, 1-3>

Like it's three counterparts ICrossVal.R expects input data to be in the specific format detailed above. The training data used in the manuscript is present in a zenodo repository, but if users can replicate this format they could certainly supply their own. The second argument refers to the construction of the input training data and for the supplied data used in the manuscript can be 1-5. The third argument is simply the name users wish to assign to their output file. The fourth argument allows users to select the alphabet argument that IDTAXA uses during training. When the input sequences are detected as a DNAStringSet this argument is ignored. It can also be left out completely, in which case the default reduced alphabet is used. The produced RData file contains the objects IRes - an object of class Taxa and Test the describes IDTAXA's performance on the tested sequences. IPerf which is a list of length 4 that describe the classification rate (first list position), the over classification error rate (second list position), the miss classification error rate (third list position), and classification and error rates at user specified confidences (fourth list position) for the sequences present in IRes.

BCrossVal.R

This script takes in the same three major arguments and assumes others if they are not supplied. Extra arguments are explained in the comments of the script. It has a similar usage. Similar to ICrossVal.R, BCrossVal.R saves off an RData file with a name supplied by the user, it will contain the objects Res - a data.frame of the raw BLAST results, PRes and ERes - objects of class Taxa and Test that represent of the BBHs within the database to the test sequences by PID and e-value (respectively). This object class is the same class of object produced by IDTAXA, and is the conversion is done to make performance comparisons trivial. Also contains the objects EPerf and PPerf which are lists of length 4 that describe the classification rate (first list position), the over classification error rate (second list position), the miss classification error rate (third list position), and classification and error rates at user specified confidences (fourth list position).

$ Rscript </path/to/script/BCrossVal.R> <Input.RData> <int, 1-5> <YourWellNamedOutput.RData>

HCrossVal.R

This script takes in the same three major arguments and assumes others if they are not supplied. Extra arguments are explained in the comments of the script. It has a similar usage and the RData file that is produced contains the objects Res, HRes, and HPerf, that are similar to the objects previously described with the exception of Res being a matrix instead HMMER's raw results as opposed to a data.frame.

$ Rscript </path/to/script/HCrossVal.R> <Input.RData> <int, 1-5> <YourWellNamedOutput.RData>

ErrorPlot.R

This script takes in results generated by the previous three scripts and plots them. It has a unique argument syntax that is not particularly user friendly, but does work. Users supply the filepaths to results generated by the preceding scripts and seperate those results with SEP where they wish to delineate a new panel. not including any SEP's will lead to all supplied results being plotted in a single panel. The script is modestly smart, but not written to accommodate an adversarial user. It can generate a 1, 2, 3, or 4 panel figure, and will attempt to provide a legend to each panel with names derived from the input file names. It will additionally attempt to provide tables describing the Over-classification and Miss-classification rates present in the supplied data fixed at 60% confidence for each method, and fixed at 60% of non-singletons classified for each method. The script expects IDTAXA results to have names in the form of METHOD_TaxLevel_AdditionalDescription.RData, and HMMER and BLAST results to have names in the form of METHOD_TaxLevel.RDATA. Legend names and table entries will be derived and organized from these names. Data that are grouped together will be placed in the same panel.

$ Rscript </path/to/script/ErrorPlot.R> <ResultFile_A_1.RDATA> <SEP> <ResultFile_A_2.RData>

The cross validation scripts take in data with a specific format, and return data with a specific format. Users can either run these with the supplied data used in the manuscript, or generate their own input data to test. Some manual manipulation will be needed to recreate the IDTAXA results for the tested literature alphabet, but that shouldn't be out of the scope of skills of anyone attempting to recreate this work.

The plotting script returns plots similar to, but not exact replicas of those present in the manuscript. Example outputs of this plotting script are provided below:

Plotting over-classification and miss-classification rates for tested classifiers:

Error Rates Plotted

Table of error rates for classifiers when classifiers are 60% confident:

Error Rates when classifiers are 60% Confident

Table of error rates for classifiers when classifiers have classified 60% of classifiable sequences:

Error Rates when classifiers have classified 60% of classifiable sequences

These output files have hard coded output names as ErrorPlot.pdf, ErrByClass60.pdf, and ErrByConf60.pdf respectively, and will be placed in R's working directory. They are linked in this readme as .jpegs because it's easier that way.

Annotation Comparison

The 5th Rscript in this repository generates a plot comparing annotations derived from BLAST and IDTAXA on any fasta file of protein sequences. This is the basis for figure 3 in the associated manuscript. Similar to the scripts described above, R must have access to BLAST. The script takes in 3 arguments in the format:

$ Rscript </path/to/script/CompareConf.R> <20200923_AATrainingData.RData> <myfastafile.faa> <KEGGorganism_code>

This script should hypothetically work with any protein fasta requested from the NCBI FTP site, or that users have locally, but this script wasn't designed to be exceptionally robust. Brettanomyces bruxellensis was chosen for it's relatively recent addition as a complete assembly to RefSeq, and the relative scarcity of KEGG sequences containing the B. bruxellensis organism code in the training set. The script uses this code to remove those associated sequences from the training set, attempting to ensure that all sequences compared, are compared without themselves being present in the training data.

Run under the conditions:

$ Rscript ~/path/to/CompareConf.R ~/path/to/20200923_AATraining.RData https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/011/074/885/GCF_011074885.1_ASM1107488v2/GCF_011074885.1_ASM1107488v2_protein.faa.gz bbrx

It will produce the following plots, which are present in figure 3 of the manuscript:

Compare annotations in a recently submitted complete eukaryotic genome in PID space

and:

Compare annotations in a recently submitted complete eukaryotic genome in EVL space

These plots have hardcoded output names and will be placed into R's working directory alongside an RData file containing the results that were used to generate these figures. Respectively, these plots will be named <providedtaxcode>_PIDComparison.pdf, <providedtaxcode>_EVLComparison.pdf, and the data will be named <providedtaxcode>_Result.RData.

Reporting Bugs and Errors

Please feel free to contact me with any concerns, questions, or bugs at [email protected], or submit issues here on Github.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages