Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make minor edits #26

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 44 additions & 41 deletions submission/manuscript.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ University of Michigan, Ann Arbor, Michigan, USA
${^*}$ Current Affiliation:

${^\#}$ Current Affiliation: Bristol Myers Squibb, Summit, New Jersey,
USA 
USA

$\dagger$ To whom correspondence should be addressed:
[pschloss\@umich.edu](mailto:[email protected])
Expand All @@ -79,27 +79,27 @@ $\dagger$ To whom correspondence should be addressed:
Machine learning classification of disease based on the gut microbiome
often relies on clustering 16S rRNA gene sequences into operational
taxonomic units (OTUs) to quantify microbial composition. The standard
approach to clustering sequences into OTUs leverages the similarity of
*de novo* approach to clustering sequences into OTUs leverages the similarity of
the sequences to each other rather than to a reference database. The
abundance of each OTU is used to train a classification model. However,
OTU assignments depend on the sequences in the data set and therefore
can change if new data are added. This lack of stability complicates
can change if new data are added. This lack of consistency complicates
classification because in order to use the model to classify additional
samples, the OTUs have to be reclustered to include the new sequences
and the model retrained with the new OTU clusters. A new reference-based
clustering algorithm, called OptiFit, addresses this issue by fitting
new sequences into existing OTUs. While OptiFit is proven to produce
high quality OTU clusters, it is unclear whether this method for fitting
new sequence data into existing OTUs will impact the performance of
new sequences into existing OTUs. While OptiFit has been shown to produce
high quality OTU clusters, it is unknown whether this method will impact
the performance of
classification models. We used OptiFit to cluster additional data into
existing OTU clusters and quantified model performance in classifying a
data set containing samples from patients with and without colonic
screen relevant neoplasias (SRN). We compared the performance of this
model to the standard procedure of clustering all the data together. We
model to the standard procedure of *de novo* clustering all the data together. We
found that both approaches performed equally well in classifying SRNs.
Moving forward, when OTUs are used in classification problems, OptiFit
can be used to avoid the need to retrain models using reclustered
sequences when classifying new samples.
can streamline the process of classifying new samples by avoiding the need to
retrain models using reclustered sequences.

## Importance

Expand All @@ -112,7 +112,7 @@ generated from new patients seeking a diagnosis, then it would be
necessary to reassign sequences to OTUs and retrain the classification
model. Yet there is a desire to have a single, validated model that can
be deployed. To overcome this obstacle, we applied the OptiFit
clustering algorithm which fits new sequence data to existing OTUs
clustering algorithm which fits new sequence data to existing OTUs,
allowing the reuse of a consistent model. A random forest machine
learning model deployed using OptiFit performed as well as the
traditional reassignment and retraining approach. This result indicates
Expand All @@ -126,59 +126,59 @@ classification of diseases, including colorectal cancer [@baxter2016;
@zackular2014]. Amplicon sequencing of the 16S rRNA gene is a reliable
tool for assessing the taxonomic composition of microbial communities,
which is the input to these models. Analysis of 16S rRNA gene sequence
data generally relies on clustering of sequences based on similarity
data generally relies on clustering sequences based on similarity
into operational taxonomic units (OTUs). The process of OTU clustering
can either be reference-based or *de novo*. The quality of OTUs
generated with reference-based clustering is generally poor compared to
those generated with *de novo* clustering @westcott2015. While *de novo*
clustering produces high-quality OTU clusters where sequences are
accurately grouped based on similarity thresholds, the resulting OTU
clusters depend on the data in the data set and the addition of new data
could change the overall OTU clusters. The unstable nature of OTU
could change the overall OTU clusters. The inconsistent nature of *de novo* OTU
clustering complicates deployment of machine learning models since
integration of additional data requires reclustering all the data and
retraining of the model. The ability to integrate new data into a
retraining the model. The ability to integrate new data into a
validated model without reclustering and retraining could allow for
deployment of a single model that new data can be continually added to.
Recently Sovacool *et al* introduced OptiFit: a method for fitting new
deployment of a single model that new data can be continually tested against.
Recently, Sovacool *et al* introduced OptiFit: a method for fitting new
sequence data into existing OTUs @sovacool2022. While OptiFit is proven
to effectively fit new sequence data to existing OTU clusters, it is
unknown if the use of OptiFit will have an impact on classification.
Here we tested the ability of OptiFit to cluster new sequence data into
unknown if the use of OptiFit will have an impact on classification performance.
Here, we tested the ability of OptiFit to cluster new sequence data into
existing OTU clusters for the purpose of classification of disease based
on gut microbiome composition.

We compared two approaches, one using all of the data to generate OTU
clusters and the other generating OTU clusters with a portion of the
We compared two approaches, one using all of the data to generate *de novo* OTU
clusters, and the other generating *de novo* OTU clusters with a portion of the
data and then fitting the remaining sequence data to the existing OTUs
using OptiFit. In the first approach, all of the 16S rRNA sequence data
was *de novo* clustered into OTUs with the OptiClust algorithm in mothur
@westcott2017. The resulting abundance data was then split into training
and testing sets, where the training set was used to tune
hyperparameters and ultimately train the model. The testing set was then
classified with the model and the performance of the model was
quantified (Figure 1A). However, with this methodology we would have to
quantified (Figure 1A). However, with this methodology, we would have to
regenerate the OTU clusters and retrain the model if we wanted to
classify additional samples. The OptiFit algorithm @sovacool2022
addresses this problem by enabling new sequences to be clustered into
existing OTUs. The OptiFit workflow is similar to the OptiClust workflow
existing OTUs. The OptiFit workflow is similar to the OptiClust workflow,
where the data was clustered into OTUs and used to tune hyperparameters
and ultimately train the model. Then, we used OptiFit to fit sequence
data of samples not part of the original data set into the existing OTUs
data of samples not part of the original data set into the existing OTUs,
and used the same model to classify the samples (Figure 1B). To test how
the model performance compares between these two methodologies, we used
a publicly available data set of 16S rRNA gene sequences from stool
samples of healthy subjects as well as subjects with SRN consisting of
advanced adenoma and carcinoma @baxter2016. The data set was randomly
split into an 80% train set and 20% test set. For the standard OptiClust
workflow, the training and test sets were *de novo* clustered together
into OTUs then the resulting abundance table was split into the training
into OTUs, then the resulting abundance table was split into the training
and testing set. For the OptiFit workflow, the train set was clustered
*de novo* into OTUs and the remaining test set was fit to the OTU
*de novo* into OTUs, and the remaining test set was fit to the OTU
clusters using the OptiFit algorithm. For both workflows, the abundance
table of the train set was used to tune hyperparameters and train a
random forest model to classify SRN. The test set was classified as
either control or SRN using the trained models.To account for variation
either control or SRN using the trained models. To account for variation
depending on the split of the data, the data set was randomly split 100
times and the process repeated for each of the 100 data splits. By
comparing the model performance of classifying the samples in the test
Expand Down Expand Up @@ -213,7 +213,7 @@ OTUs, we expected the MCC scores produced by the OptiClust and OptiFit
workflows to be similar. Since the data was only clustered once in the
OptiClust workflow there was only one MCC score while the OptiFit
workflow produced an MCC score for the OTU clusters from each data
split. Overall the MCC scores were similar between OptiClust (MCC =
split. Overall, the MCC scores were similar between OptiClust (MCC =
`r round(opticlust_mcc,digits=3)`) and OptiFit (average MCC =
`r round(optifit_avg_mcc,digits=3)`). This indicated that OptiFit
performed as well as OptiClust when integrating new sequences into the
Expand All @@ -240,7 +240,7 @@ pvals <- read_csv("../results/tables/pvalues.csv",col_types = cols(p_value = col

After verifying that the quality of the OTUs was consistent between
OptiClust and OptiFit, we examined the model performance for classifying
samples in the held out test data set. To quantify model performance we
samples in the held out test data set. To quantify model performance, we
used the OTU relative abundances from the training data from the
OptiClust and OptiFit workflows to train a model to predict SRNs. Using
the predicted and actual diagnosis classification, we calculated the
Expand Down Expand Up @@ -268,10 +268,10 @@ We tested the ability of OptiFit to integrate new data into existing
OTUs for the purpose of machine learning classification using OTU
relative abundance. A potential problem with using OptiFit is that any
sequences in the new data that do not map to the existing OTU clusters
will be discarded resulting in a possible loss of information. However,
will be discarded, resulting in a possible loss of information. However,
we demonstrated that OptiFit can be used to fit new sequence data into
existing OTU clusters and perform equally well in predicting SRN
compared to clustering all of the sequence data together. The ability to
existing OTU clusters and performs equally well in predicting SRN
compared to *de novo* clustering all of the sequence data together. The ability to
integrate data from new samples into existing OTUs enables the
deployment of a single machine learning model. These results are based
on a single data set and disease. Further analysis is needed to
Expand All @@ -287,7 +287,7 @@ stool samples was downloaded from NCBI Sequence Read Archive (accession
no. SRP062005) [@edgar2011; @baxter2016]. This data set contains stool
samples from a total of 490 subjects. For this analysis, samples from
subjects identified in the metadata as normal, high risk normal, or
adenoma were categorized as "normal" while samples from subjects
adenoma were categorized as "normal", while samples from subjects
identified as advanced adenoma or carcinoma were categorized as "screen
relevant neoplasia" (SRN). The resulting data set consisted of 261
normal samples and 229 SRN samples.
Expand Down Expand Up @@ -321,23 +321,26 @@ the test set an average of `r n_test_train %>% pull(avg_test)` times
(SD=`r format_decimal(n_test_train %>% pull(sd_test),digits = 1)`).

The data was processed through two workflows. First, the standard
workflow using the OptiClust algorithm @westcott2017. In this pathway,
workflow using the OptiClust algorithm @westcott2017. In this workflow,
all of the data was clustered together with OptiClust to generate OTUs
and the resulting abundance tables were split into the training and
testing sets. In the second workflow, the preprocessed data was split
into the training and testing sets. The training set was clustered into
OTUs, then the test set was fit to the OTUs of the training set using
the OptiFit algorithm @sovacool2022. The OptiFit algorithm was run with
method open so that any sequences that did not map to the existing OTU
clusters would form new OTUs. For both pathways, the shared files were
the OptiFit algorithm @sovacool2022. The OptiFit algorithm was run with the open
method so that any sequences that did not map to the existing OTU
clusters would form new OTUs.
<!-- Did you then remove the columns corresponding to the additional OTUs?
Typically you'd want to run the closed method for this use-case, right? -->
Comment on lines +330 to +334
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically for this use-case you'd want the closed method so your shared file will have the same OTUs as the reference. Did you then remove those columns before ML?

For both workflows, the shared files were
sub-sampled to 10,000 reads per sample.

***Machine Learning.*** Machine learning using Random Forest was
conducted with the R package mikrompl (v 1.2.0) @topçuoglu2021 to
predict the diagnosis (SRN or normal) for the samples in the test set
for each data split. The training set was preprocessed to normalize OTU
counts (scale/center), collapse correlated OTUs, and remove OTUs with
zero-variance. The preprocessing from the training set was then applied
counts (scale and center), collapse correlated OTUs, and remove OTUs with
zero variance. The preprocessing from the training set was then applied
to the test set. Any OTUs in the test set that were not in the training
set were removed. P values comparing model performance were calculated
as previously described @topçuoglu2020. The averaged ROC curves were
Expand Down Expand Up @@ -387,8 +390,8 @@ This work was supported through a grant from the NIH (R01CA215574).
was clustered into OTUs using the OptiClust algorithm in mothur. The
data was then split into two sets where 80% of the samples were assigned
to the training set and 20% to the testing set. The training set was
preprocessed with mikropml to normalize values (scale/center), collapse
correlated features, and remove features with zero-variance. Using
preprocessed with mikropml to normalize values (scale and center), collapse
correlated features, and remove features with zero variance. Using
mikropml, the training set was split into train and validate sets to
compare results using different hyperparameter settings. The highest
performing hyperparameter setting was then used to train the model with
Expand All @@ -405,12 +408,12 @@ sets where 80% of the samples were assigned to the training set and 20%
to the testing set. The training set was then clustered into OTUs using
the OptiClust algorithm in mothur. The resulting abundance data was
preprocessed with mikropml to normalize values (scale/center), collapse
correlated features, and remove features with zero-variance. Using
correlated features, and remove features with zero variance. Using
mikropml, the training set was split into train and validate sets to
compare results using different hyperparameter settings. The highest
performing hyperparameter setting was then used to train the model with
the full training set. The OptiFit algorithm in mothur was used to
cluster the left out testing data set using the OTUs of the training set
cluster the held out testing data set using the OTUs of the training set
as a reference. The preprocessing scale from the training set was
applied to the test data set, then the trained model was used to
classify the samples in the test set. Based on the actual classification
Expand Down