diff --git a/docs/2_method/3_mip.md b/docs/2_method/3_mip.md
index cb43952..ea1b628 100644
--- a/docs/2_method/3_mip.md
+++ b/docs/2_method/3_mip.md
@@ -37,5 +37,5 @@ After solving this integer program, the non-zero diagonal entries of $$A$$ repre
 
 ![Example output from the clustering process, where an entry of 1 indicates that time series $$j$$ belongs to cluster with centroid i.](cluster_matrix_formation4.svg)
 
-Finding global optimality can increase the computation time, depending on the number of time series within the dataset and the DTW distances. Therefore, there is also a built-in option to cluster using k-medoids, described in [k-Medoids Clustering](link to that). The k-medoids method is often quicker as it is an iterative approach, however it is subject to getting stuck in local optima. The results in the next section show the timing and memory performance of both MIP clustering and k-medoids clustering using *DTW-C++* compared to other packages.
+Finding global optimality can increase the computation time, depending on the number of time series within the dataset and the DTW distances. Therefore, there is also a built-in option to cluster using k-medoids. The k-medoids method is often quicker as it is an iterative approach, however it is subject to getting stuck in local optima. The results in the next section show the timing and memory performance of both MIP clustering and k-medoids clustering using *DTW-C++* compared to other packages.
 
diff --git a/joss/paper.bib b/joss/paper.bib
index d3fee47..6bcdb62 100644
--- a/joss/paper.bib
+++ b/joss/paper.bib
@@ -150,3 +150,17 @@ @misc{kumtepeli2023fast
       archivePrefix={arXiv},
       primaryClass={eess.SP}
 }
+
+@article{ROUSSEEUW198753,
+title = {Silhouettes: A graphical aid to the interpretation and validation of cluster analysis},
+journal = {Journal of Computational and Applied Mathematics},
+volume = {20},
+pages = {53-65},
+year = {1987},
+issn = {0377-0427},
+doi = {https://doi.org/10.1016/0377-0427(87)90125-7},
+url = {https://www.sciencedirect.com/science/article/pii/0377042787901257},
+author = {Peter J. Rousseeuw},
+keywords = {Graphical display, cluster analysis, clustering validity, classification},
+abstract = {A new graphical display is proposed for partitioning techniques. Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters. The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration. The average silhouette width provides an evaluation of clustering validity, and might be used to select an ‘appropriate’ number of clusters.}
+}
diff --git a/joss/paper.md b/joss/paper.md
index 697aa6d..8bc53d9 100644
--- a/joss/paper.md
+++ b/joss/paper.md
@@ -5,7 +5,7 @@ tags:
   - Dynamic time warping
   - Clustering
   - k-medoids
-  - Mixed integer programming
+  - Integer programming
   - Dynamic programming
   - Time series
 authors:
@@ -27,34 +27,34 @@ bibliography: paper.bib
 
 # Summary
 
-Time-series data analysis is of interest in a huge number of different applications, from finding patterns of energy consumption, to detecting brain activity, to discovering stock price trends. Unsupervised learning methods can help analysts unlock patterns in data, and an important example method is clustering. However, clustering of time series data can be computationally expensive for large datasets. We present an approach for computationally efficient dynamic time warping (DTW) and clustering of time-series data. The method frames the dynamic warping of time series datasets as an optimisation problem solved using dynamic programming, and then clusters time series data by solving a second optimisation problem using integer programming. There is also an option to use k-medoids clustering for increased speed, when a certificate for global optimality is not essential. The increased speed of our approach is due to task-level parallelisation and memory efficiency improvements. The approach was tested using the UCR Time Series Archive, and was found to be on average 33% faster than the next fastest option when using the same clustering method. This increases to 64% faster when considering only larger datasets (with more than 1000 time series). The integer programming clustering is most effective on small numbers of longer time series, because the DTW computation is faster than other approaches, but the clustering problem becomes increasingly computationally expensive as the number of time series increases.
+Time-series data analysis is of interest in a huge number of different applications, from finding patterns of energy consumption to detecting brain activity or discovering stock price trends. Unsupervised learning methods can help analysts unlock patterns in data, and a key example of this is clustering. However, clustering of time series data can be computationally expensive for large datasets. We present an approach for computationally efficient dynamic time warping (DTW) and clustering of time-series data. The method frames the dynamic warping of time series datasets as an optimisation problem solved using dynamic programming, and then clusters time series data by solving a second optimisation problem using integer programming. There is also an option to use k-medoids clustering when a certificate for global optimality is not essential. The increased speed of our approach is due to task-level parallelisation and memory efficiency improvements. The method was tested using the UCR Time Series Archive, and was found to be on average 33% faster than the next fastest option when using the same clustering approach. This increases to 64% faster when considering only larger datasets (with more than 1000 time series). The integer programming clustering is most effective on small numbers of longer time series, because the DTW computation is faster than other approaches, but the clustering problem becomes increasingly computationally expensive as the number of time series increases.
 
 # Statement of need
 
-The target audience for this software is very broad---basically, anyone interested in analysing time series data. Clustering of time series data is of interest across a very broad range of applications, from energy to finance and medicine. However, as data availability increases, so does the complexity of the clustering problem. Most time series clustering algorithms depend on dimension reduction or feature extraction techniques to enable scaling to large datasets, but this can induce bias in the clustering [@Aghabozorgi2015]. Dynamic time warping [@Sakoe1978] is a well-known technique for manipulating time series to enable comparisons between datasets, using local warping (stretching or compressing along the time axis) of the elements within each time series to find an optimal alignment between series. This emphasises the similarity of the shapes of the respective time series rather than the exact alignment of specific features. Unfortunately, DTW does not scale well in computational speed as the length and number of time series to be compared increases---the computational complexity grows quadratically with the total number of data points. This is a barrier to DTW being widely implemented in large-scale time series clustering  [@Rajabi2020]. In response, `DTW-C++` was written to handle large time series datasets efficiently, directly processing the raw data rather than first extracting features or reduced-dimension data. 
+The target audience for this software is very broad, since clustering of time series data is relevant in many applications from energy to finance and medicine. However, as data availability increases, so does the complexity of the clustering problem. Most time series clustering algorithms depend on dimension reduction or feature extraction techniques to enable scaling to large datasets, but this can induce bias in the clustering [@Aghabozorgi2015]. Dynamic time warping [@Sakoe1978] is a well-known technique for manipulating time series to enable comparisons between datasets, using local warping (stretching or compressing along the time axis) of the elements within each time series to find an optimal alignment between series. This emphasises the similarity of the shapes of the respective time series rather than the exact alignment of specific features. Unfortunately, DTW does not scale well in computational speed as the length and number of time series to be compared increases---the computational complexity grows quadratically with the total number of data points. This is a barrier to DTW being widely implemented in large-scale time series clustering  [@Rajabi2020]. In response, `DTW-C++` was written to handle large time series efficiently, directly processing the raw data rather than first extracting features. 
 
 In contrast to existing tools available for time series clustering using DTW, such as `DTAIDistance` [@meert2020wannesm] and `TSlearn` [@Tavenard2020], `DTW-C++` offers significant improvements in speed and memory use, enabling larger datasets to be clustered. This is achieved by
 
 1. task-level parallelisation, where multiple pairwise comparisons between time series can be evaluated simultaneously, and,
 2. improved memory management---since the clustering algorithm only needs the final distance computed between pairwise time series, the DTW distance computation stores only the most recent previous vector, rather than the entire warping matrix.
 
-In addition, `DTW-C++` offers the option of clustering using a new algorithm (described below) based on integer programming. The advantage of this over k-based methods is that it guarantees finding a global optimal solution in most cases,  and in the rare event that the global optimum cannot be found, the gap between the best solution found, and the global optimum is given.
+In addition, `DTW-C++` offers the option of clustering using a new algorithm (described below) based on integer programming. The advantage of this over k-based methods is that it guarantees finding a global optimal solution in most cases, and in the rare event that the global optimum cannot be found, the gap between the best solution and the global optimum is given.
 
 # Current ``DTW-C++`` functionality
 
-The current functionality of the software is as follows:
+The current functionality of the software is:
 
-* Calculate DTW pairwise distances between time series, using a vector based approach, to reduce memory use. There is also the option to use a Sakoe-Chiba band to restrict warping in the DTW distance calculation [@Sakoe1978]. This speeds up the computation time, as well as being a useful constraint for some time series clustering scenarios (e.g., if an event must occur within a certain time window to be considered similar).
+* Calculate DTW pairwise distances between all pairs of time series in a set, using a vector based approach to reduce memory use. There is also the option to use a Sakoe-Chiba band to restrict warping in the DTW distance calculation [@Sakoe1978]. This speeds up the computation time, as well as being a useful constraint for some clustering scenarios (e.g., if an event must occur within a certain time window to be considered similar).
 * Produce a distance matrix containing all pairwise comparisons between each time series in the dataset.
 * Split all time series into a predefined number of clusters, with a representative centroid time series for each cluster. This can be done using integer programming or k-medoids clustering, depending on user choice.
 * Output the clustering cost, which is the sum of distances between every time series within each cluster and its cluster centroid.
-* Find the silhouette score and elbow score for the clusters in order to aid the user decision on how many clusters, $k$, to include.
+* Find the silhouette score and elbow score for the clusters to aid the user decision on how many clusters, $k$, to include. The silhouette score is defined by the difference between the mean intra-cluster distance and the mean nearest-cluster distance, divided by the maximum of these two distances [@ROUSSEEUW198753]. This considers both the similarity of a time series to its own cluster as well as its dissimilarity from other clusters. The elbow score is based on the cost of the clustering exercise, which sums together the distance between each time series and its centroid. Therefore the similarity of a time series to its own cluster is considered, but not its dissimilarity from other clusters.
 
 # Mathematical background
 
 ## Dynamic time warping
 
-Consider a time series to be a vector of some arbitrary length. Consider that we have $p$ such vectors in total, each possibly differing in length. To find a subset of $k$ clusters within the set of $p$ vectors using integer programming formulation, we must first make $\frac{1}{2} {p \choose 2}$ pairwise comparisons between all vectors within the total set and find the `similarity' between each pair. In this case, the similarity is defined as the DTW distance. Consider two time series $x$ and $y$ of differing lengths $n$ and $m$ respectively,
+Consider a time series to be a vector of arbitrary length. Consider that we have $p$ such vectors in total, each possibly differing in length. To find a subset of $k$ clusters within the set of $p$ vectors, we must first make $\frac{1}{2} {p \choose 2}$ pairwise comparisons between all vectors within the total set and find the `similarity' between each pair. In this case, the similarity is defined as the DTW distance. Consider two time series $x$ and $y$ of differing lengths $n$ and $m$ respectively,
 
 $$
 x=(x_1, x_2, ..., x_n)
@@ -63,16 +63,15 @@ $$
 y=(y_1, y_2, ..., y_m).
 $$
 
-The DTW distance is the sum of the Euclidean distance between each point and its matched point(s) in the other vector, as shown in \autoref{fig:warping_signals}. The following constraints must be met:
+The DTW distance is the sum of the Euclidean distance between each point and its matched point(s) in the other vector, as shown in \autoref{fig:warping_signals}. To find the DTW distance, the following constraints must be met:
 
 1. The first and last elements of each series must be matched.
-2. Only unidirectional forward movement through relative time is allowed, i.e., if $x_1$ is mapped to $y_2$ then $x_2$ may not be mapped to
-    $y_1$ (monotonicity). 
+2. Only unidirectional forward movement through relative time is allowed, i.e., if $x_1$ is mapped to $y_2$ then $x_2$ may not be mapped to $y_1$ (monotonicity). 
 3. Each point is mapped to at least one other point, i.e., there are no jumps in time (continuity).
 
-![Two time series with DTW pairwise alignment between each element, showing one-to-many mapping properties of DTW (a). Cost matrix $C$ for the two time series, showing the warping path and final DTW cost at $C_{14,13}$ (b). \label{fig:warping_signals}](../media/warping_path-imageonline.co-merged.png)
+![(a) Two time series with DTW pairwise alignment between each point, showing the one-to-many mapping properties of DTW. (b) Cost matrix $C$ for the two time series, showing the warping path and final DTW cost at $c_{14,13}$. \label{fig:warping_signals}](../media/warping_merged_cropped.pdf)
 
-Finding the optimal warping arrangement is an optimisation problem that can be solved using dynamic programming, which splits the problem into easier sub-problems and solves them recursively, storing intermediate solutions until the final solution is reached. To understand the memory-efficient method used in ``DTW-C++``, it is useful to first examine the full-cost matrix solution, as follows. For each pairwise comparison, an $n$ by $m$ matrix $C^{n\times m}$ is calculated, where each element represents the cumulative cost between series up to the points $x_i$ and $y_j$:
+Finding the optimal warping arrangement is an optimisation problem that can be solved using dynamic programming, which splits the problem into easier sub-problems and solves them recursively, storing intermediate solutions until the final solution is reached. To understand the memory-efficient method used in ``DTW-C++``, it is useful to first examine the full cost matrix solution, as follows. For each pairwise comparison, an $n$ by $m$ matrix $C^{n\times m}$ is calculated, where each element represents the cumulative cost between series up to the points $x_i$ and $y_j$:
 
 \begin{equation}
     \label{c}
@@ -83,53 +82,50 @@ Finding the optimal warping arrangement is an optimisation problem that can be s
     \end{cases}
 \end{equation}
 
-The final element $c_{n,m}$ is then the total cost, $C_{x,y}$, which provides the comparison metric between the two series $x$ and $y$. \autoref{fig:warping_signals} shows an example of this cost matrix $C$ and the warping path through it.
+The final element in the matrix $c_{n,m}$ is then the total cost, and this provides the metric for comparing  the two series $x$ and $y$. \autoref{fig:warping_signals} shows an example of this cost matrix $C$ and the warping path through it.
 
 ## Clustering 
 
-For the clustering problem, only this final cost for each pairwise comparison is required; the actual warping path (or mapping of each point in one time series to the other) is superfluous for k-medoids clustering. The memory complexity of the cost matrix $C$ is $O(nm)$, so as the length of the time series increases, the memory required increases greatly. Therefore, significant reductions in memory can be made by not storing the entire $C$ matrix. When the warping path is not required, only a vector containing the previous row for the current step of the dynamic programming sub-problem is required (i.e., the previous three values $c_{i-1,j-1}$, $c_{i-1,j}$, $c_{i,j-1}$), as indicated in \autoref{c}.
+For the clustering algorithm, only the final cost for each pairwise comparison is required; the actual warping path (i.e., mapping between time series) is superfluous. The memory complexity of the cost matrix $C$ is $\mathcal{O}(nm)$, so as the length of the time series grows, the memory required greatly increases. Therefore, significant reductions in memory use can be achieved by not storing the entire cost matrix. Since the warping path is not required, we only need to store a vector containing the previous row relating to the current step of the dynamic programming sub-problem (i.e., the previous three values $c_{i-1,j-1}$, $c_{i-1,j}$, $c_{i,j-1}$), as indicated in \autoref{c}.
 
-The DTW distance $C_{x,y}$ is found for each pairwise comparison. As shown in \ref{fig:c_to_d}, pairwise distances are then stored in a separate symmetric matrix, $D^{p\times p}$, where $p$ is the total number of time series in the clustering exercise. In other words, the element $d_{i,j}$ gives the distance between time series $i$ and $j$.
+We now introduce the notation $d_{x,y}=c_{n,m}$ to denote the final (scalar) cost relating to the pairwise comparison between time series $x$ and $y$, given by the final element in the cost matrix relating to the $x$ and $y$ time series. To cluster several time series, this cost is first computed for every pairwise comparison between every time series. As shown in \autoref{fig:c_to_d}, all of the pairwise distances are then stored in a separate symmetric matrix, $D^{p\times p}$, where $p$ is the total number of time series in the clustering exercise. In other words, the element $d_{i,j}$ in this matrix gives the cost between time series $i$ and $j$.
 
-![The DTW costs of all the pairwise comparisons between time series in the dataset are combined to make a distance matrix $D$. \label{fig:c_to_d}](../media/distance_matrix_formation_vector.pdf)
+![The individual DTW costs from each pairwise comparison between time series in the dataset are all combined to form a distance matrix $D$. \label{fig:c_to_d}](../media/c_to_d_cropped.pdf)
 
-Using this matrix, $D$, the time series can be split into $k$ separate clusters with integer programming. The problem formulation begins with a binary square matrix $A^{p\times p}$, where $A_{ij}=1$ if time series $j$ is a member of the $i$th cluster centroid, and 0 otherwise, as shown in \autoref{fig:A_matrix}.
+Using this distance matrix, $D$, the full set of time series can be split into $k$ separate clusters with integer programming. The problem formulation begins by considering a binary square matrix $A^{p\times p}$, where $A_{ij}=1$ if time series $j$ is a member of the $i$th cluster centroid, and 0 otherwise, as shown in \autoref{fig:A_matrix}.
 
-![Example output from the clustering process, where an entry of 1 indicates that time series $j$ belongs to cluster with centroid $i$. \label{fig:A_matrix}](../media/cluster_matrix_formation4.svg){ width=70% }
+![Example clustering matrix, where an entry of 1 indicates that time series $j$ belongs to the cluster with centroid $i$. \label{fig:A_matrix}](../media/clustering_cropped.pdf){ width=70% }
 
-As each centroid has to be in its own cluster, non-zero diagonal entries in  $A$ represent centroids. In summary, the following constraints apply: 
+As each centroid has to be in its own cluster, non-zero diagonal entries in $A$ represent centroids. Our objective is to find $A$, and this may be formulated as an optimisation problem
 
-1. Only $k$ series can be centroids,
+\begin{equation}
+    A^\star = \underset{A}{\rm argmin} \sum_i \sum_j D_{ij} \times A_{ij},
+\end{equation}
 
+subject to the following constrants:
+
+1. Only $k$ series can be centroids,
 $$
 \sum_{i=1}^p A_{ii}=k.
 $$
 
-2. Each time series must be in one and only one cluster,
-
+2. Each time series must be a member of one and only one cluster,
 $$
 \sum_{i=1}^pA_{ij}=1  \quad \forall j \in [1,p].
 $$
 
 3. In any row, there can only be non-zero entries if the corresponding diagonal entry is non-zero, so a time series can only be in a cluster where the row corresponds to a centroid time series,
-
 $$
 A_{ij} \le A_{ii} \quad \forall i,j \in [1,p].
 $$
 
-The optimisation problem to solve, subject to the above constraints, is
-
-\begin{equation}
-    A^\star = \min_{A} \sum_i \sum_j D_{ij} \times A_{ij}.
-\end{equation}
-
-This integer program is solved using Gurobi [@gurobi] or HiGHS [@Huangfu2018]. After solving this integer program, the non-zero diagonal entries of $A$ represent the centroids, and the non-zero elements in the corresponding columns in $A$ represent the members of that cluster. In the example in \autoref{fig:A_matrix}, the clusters are time series 1, **2**, 5 and 3, **4** with the bold time series being the centroids.
+This integer program is solved in `DTW-C++` using Gurobi [@gurobi] or HiGHS [@Huangfu2018]. After solution, the non-zero diagonal entries of $A$ represent the centroids, and the non-zero elements in the corresponding columns in $A$ represent the members of that cluster. In the example in \autoref{fig:A_matrix}, the clusters are time series 1, **2**, 5 and 3, **4** with the bold type face entries indicating the centroids.
 
-Finding global optimality can increase the computation time, depending on the number of time series within the dataset and the DTW distances. Therefore, there is also a built-in option to cluster using k-medoids, as used in other packages such as \texttt{DTAIDistance} [@meert2020wannesm]. The k-medoids method is often quicker as it is an iterative approach, however it is subject to getting stuck in local optima. The results in the next section show the timing and memory performance of both integer programming clustering and k-medoids clustering using \texttt{DTW-C++} compared to other packages.
+Finding a globally optimal solution with this method can result in increased computation times depending on the number of time series within the dataset and the DTW distances. Therefore, there is also a built-in option to cluster using k-medoids, as used in other packages such as \texttt{DTAIDistance} [@meert2020wannesm]. The k-medoids method is often quicker as it is an iterative approach, however it is subject to getting stuck in local optima. The results in the next section show the timing and memory performance of both integer programming clustering and k-medoids clustering using \texttt{DTW-C++} compared to other packages.
 
 # Comparison
 
-We compared our approach with two other DTW clustering packages, \texttt{DTAIDistance} [@meert2020wannesm] and \texttt{TSlearn} [@Tavenard2020]. The datasets used for the comparison are from the UCR Time Series Classification Archive [@Dau2018], and consist of 128 time series datasets with up to 16,800 data series of lengths up to 2,844. The full results can be found in the Appendix. Benchmarking against  \texttt{TSlearn}  was stopped after the first 22 datasets because the results were consistently over 20 times slower than \texttt{DTW-C++}. \autoref{tab:small_table} shows the results for datasets downselected to have a number of time series ($N$) greater than 100 and a length of each time series greater than 500 points. This is because \texttt{DTW-C++} is aimed at larger datasets where the speed improvements are more relevant.
+We compared our approach with two other DTW clustering packages, \texttt{DTAIDistance} [@meert2020wannesm] and \texttt{TSlearn} [@Tavenard2020] using data from the UCR Time Series Classification Archive [@Dau2018], which consists of 128 time series datasets with up to 16,800 data series of lengths up to 2,844. Benchmarking against  \texttt{TSlearn}  was stopped after the first 22 datasets because the results were consistently over 20 times slower than \texttt{DTW-C++}. \autoref{tab:small_table} shows the results for datasets downselected to have the number of time series, $N$, greater than 100, and the length of each time series greater than 500 points. This is because \texttt{DTW-C++} is aimed at larger datasets where the speed improvements are more relevant.
 
 \begin{table}[]
 \resizebox{\textwidth}{!}{%
@@ -168,24 +164,24 @@ SmallKitchenAppliances     & 375  & 720  & 41.7        & \textbf{23.8}     & 30.
 StarLightCurves            & 8236 & 1024 & N/A         & \textbf{18551.7}  & 27558.1      & 33                 \\
 UWaveGestureLibraryAll     & 3582 & 945  & N/A         & \textbf{1194.6}   & 4436.9       & 73                
 \end{tabular}}
-\caption{Computational time comparison of \texttt{DTW-C++} using integer programming and k-medoids, vs.\ \texttt{DTAIDistance}, and \texttt{TSlearn}, on datasets in the UCR Time Series Classification Archive where $N>100$ and $L>500$. The fastest result for each dataset is bold.}
+\caption{Computational time comparison between \texttt{DTW-C++} using integer programming and k-medoids, vs.\ \texttt{DTAIDistance}, and \texttt{TSlearn}, on datasets in the UCR Time Series Classification Archive where $N>100$ and $L>500$. The fastest result for each dataset is in bold type.}
 \label{tab:small_table}
 \end{table}
 
 
-As can be seen in these results, \texttt{DTW-C++} is the fastest package for 90\% of the datasets, and all 13 datasets where \texttt{DTAIDistance} was faster were cases where the entire clustering process was completed in 1.06 seconds or less. Across the whole collection of datasets, \texttt{DTW-C++} was on average 32% faster. When looking at larger datasets with $N > 1000$, \texttt{DTW-C++} is on average 65% faster. In all, apart from 2 of the 115 cases where \texttt{DTW-C++} is the fastest, it uses the k-medoids algorithm. This is however to be expected as the latter is an iterative clustering method and therefore does not compute all DTW distances. \autoref{fig:k_med} clearly shows the increasing superiority of \texttt{DTW-C++} as the number of time series increases. In this comparison, both algorithms use k-medoids, so the speed improvement is due to faster dynamic time warping.
+\texttt{DTW-C++} is the fastest package for 90\% of the datasets, and all 13 datasets where \texttt{DTAIDistance} was faster were cases where the entire clustering process was completed in 1.06 seconds or less. Across the whole collection of datasets, \texttt{DTW-C++} was on average 32% faster. When looking at larger datasets, with $N > 1000$, \texttt{DTW-C++} is on average 65% faster. In all, apart from 2 of the 115 cases where \texttt{DTW-C++} is the fastest, we used the k-medoids algorithm for clustering. \autoref{fig:k_med} shows the increasing performance of \texttt{DTW-C++} as the number of time series increases. In this comparison, both algorithms used k-medoids, so the speed improvement is due to faster dynamic time warping method in \texttt{DTW-C++}.
 
-\texttt{DTW-C++} IP was on average 16 times slower than \texttt{DTAIDistance} over all samples and as the number of time series increases, integer programming clustering becomes increasingly slower, as demonstrated in Fig. 5. This is to be expected because the computational complexity of the integer programming clustering optimisation increases significantly as the number of time series in the clustering problem increases. However, as the length of the time series increases, the performance of integer programming converges to the speed of \texttt{DTAIDistance}, while finding global optimality. This confirms the improved performance of DTW in \texttt{DTW-C++}. Therefore, the integer programming approach is recommended for occasions when the time series to be clustered are very long, but the number of time series is smaller.
+With respect to clustering, \texttt{DTW-C++} with integer programming was on average 16 times slower than \texttt{DTAIDistance} over all samples, and as the number of time series increases, integer programming clustering becomes increasingly slower (\autoref{fig:speed_IP}). This is to be expected because the computational complexity of the integer programming optimisation increases significantly as the number of time series in the clustering problem increases. However, as the lengths of each time series increase, the performance of integer programming converges to the speed of \texttt{DTAIDistance}, and the former finds globally optimal results. Therefore, the integer programming approach is recommended for occasions when the individual time series to be clustered are very long, but the number of individual time series is small (e.g., fewer than 1000).
 
-The comparison of all datasets in the UCR Time Series Classification Archive can be found in reference [@kumtepeli2023fast].
+The performance comparison on all datasets in the UCR Time Series Classification Archive and any updated benchmarking tests can be found in the repository.
 
-![\texttt{DTW-C++} k-medoids  clustering becomes increasingly faster compared to \texttt{DTAIDistance} as the number of time series increases. \label{fig:k_med}](../media/k_med_speed_nn.pdf){ width=80% }
+![\texttt{DTW-C++} with k-medoids clustering becomes increasingly faster compared to \texttt{DTAIDistance} as the number of time series increases. \label{fig:k_med}](../media/k_med_speed_nn.pdf){ width=80% }
 
-![Change in computational time of \texttt{DTW-C++} using IP DTW clustering compared to \texttt{DTAIDistance} as the number of time series in the datasets to be clustered increases and the length of time series in the datasets increases.](../media/ip_speed.pdf){ width=80%}
+![Change in computational time of \texttt{DTW-C++} using integer programming clustering compared with \texttt{DTAIDistance} as the number of time series in the datasets to be clustered increases and the length of time series in the datasets increases. \label{fig:speed_IP}](../media/ip_speed.pdf){ width=80%}
 
 # Acknowledgements
 
-We gratefully acknowledge the contributions by [Battery Intelligence Lab](https://howey.eng.ox.ac.uk) members, and thank BBOXX for project funding and access to data. This work was also funded by the UKRI PFER Energy Superhub Oxford demonstrator and the ``Data-driven exploration of the carbon emissions impact of grid energy storage deployment and dispatch'' project (EP/W027321/1).
+We are grateful for discussions of this topic with [Battery Intelligence Lab](https://howey.eng.ox.ac.uk) members, and thank BBOXX for project funding and access to data. This work was also funded by the UKRI PFER Energy Superhub Oxford demonstrator and the ``Data-driven exploration of the carbon emissions impact of grid energy storage deployment and dispatch'' project (EP/W027321/1).
 
 # References
 
diff --git a/media/c_to_d_cropped.pdf b/media/c_to_d_cropped.pdf
new file mode 100644
index 0000000..9584447
Binary files /dev/null and b/media/c_to_d_cropped.pdf differ
diff --git a/media/clustering_cropped.pdf b/media/clustering_cropped.pdf
new file mode 100644
index 0000000..dddc017
Binary files /dev/null and b/media/clustering_cropped.pdf differ
diff --git a/media/warping_merged_cropped.pdf b/media/warping_merged_cropped.pdf
new file mode 100644
index 0000000..fda92b4
Binary files /dev/null and b/media/warping_merged_cropped.pdf differ