Skip to content

Commit

Permalink
Update paper.md
Browse files Browse the repository at this point in the history
Dave's latest edits, up to around the section 'clustering'
  • Loading branch information
davidhowey authored Apr 25, 2024
1 parent 58d9b7e commit 3486818
Showing 1 changed file with 10 additions and 11 deletions.
21 changes: 10 additions & 11 deletions joss/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ bibliography: paper.bib

# Summary

Time-series data analysis is of interest in a huge number of different applications, from finding patterns of energy consumption, to detecting brain activity, to discovering stock price trends. Unsupervised learning methods can help analysts unlock patterns in data, and an important example method is clustering. However, clustering of time series data can be computationally expensive for large datasets. We present an approach for computationally efficient dynamic time warping (DTW) and clustering of time-series data. The method frames the dynamic warping of time series datasets as an optimisation problem solved using dynamic programming, and then clusters time series data by solving a second optimisation problem using integer programming. There is also an option to use k-medoids clustering for increased speed, when a certificate for global optimality is not essential. The increased speed of our approach is due to task-level parallelisation and memory efficiency improvements. The approach was tested using the UCR Time Series Archive, and was found to be on average 33% faster than the next fastest option when using the same clustering method. This increases to 64% faster when considering only larger datasets (with more than 1000 time series). The integer programming clustering is most effective on small numbers of longer time series, because the DTW computation is faster than other approaches, but the clustering problem becomes increasingly computationally expensive as the number of time series increases.
Time-series data analysis is of interest in a huge number of different applications, from finding patterns of energy consumption to detecting brain activity or discovering stock price trends. Unsupervised learning methods can help analysts unlock patterns in data, and a key example of this is clustering. However, clustering of time series data can be computationally expensive for large datasets. We present an approach for computationally efficient dynamic time warping (DTW) and clustering of time-series data. The method frames the dynamic warping of time series datasets as an optimisation problem solved using dynamic programming, and then clusters time series data by solving a second optimisation problem using integer programming. There is also an option to use k-medoids clustering when a certificate for global optimality is not essential. The increased speed of our approach is due to task-level parallelisation and memory efficiency improvements. The method was tested using the UCR Time Series Archive, and was found to be on average 33% faster than the next fastest option when using the same clustering approach. This increases to 64% faster when considering only larger datasets (with more than 1000 time series). The integer programming clustering is most effective on small numbers of longer time series, because the DTW computation is faster than other approaches, but the clustering problem becomes increasingly computationally expensive as the number of time series increases.

# Statement of need

The target audience for this software is very broad---basically, anyone interested in analysing time series data. Clustering of time series data is of relevant across many applications, from energy to finance and medicine. However, as data availability increases, so does the complexity of the clustering problem. Most time series clustering algorithms depend on dimension reduction or feature extraction techniques to enable scaling to large datasets, but this can induce bias in the clustering [@Aghabozorgi2015]. Dynamic time warping [@Sakoe1978] is a well-known technique for manipulating time series to enable comparisons between datasets, using local warping (stretching or compressing along the time axis) of the elements within each time series to find an optimal alignment between series. This emphasises the similarity of the shapes of the respective time series rather than the exact alignment of specific features. Unfortunately, DTW does not scale well in computational speed as the length and number of time series to be compared increases---the computational complexity grows quadratically with the total number of data points. This is a barrier to DTW being widely implemented in large-scale time series clustering [@Rajabi2020]. In response, `DTW-C++` was written to handle large time series datasets efficiently, directly processing the raw data rather than first extracting features or reduced-dimension data.
The target audience for this software is very broad, since clustering of time series data is relevant in many applications from energy to finance and medicine. However, as data availability increases, so does the complexity of the clustering problem. Most time series clustering algorithms depend on dimension reduction or feature extraction techniques to enable scaling to large datasets, but this can induce bias in the clustering [@Aghabozorgi2015]. Dynamic time warping [@Sakoe1978] is a well-known technique for manipulating time series to enable comparisons between datasets, using local warping (stretching or compressing along the time axis) of the elements within each time series to find an optimal alignment between series. This emphasises the similarity of the shapes of the respective time series rather than the exact alignment of specific features. Unfortunately, DTW does not scale well in computational speed as the length and number of time series to be compared increases---the computational complexity grows quadratically with the total number of data points. This is a barrier to DTW being widely implemented in large-scale time series clustering [@Rajabi2020]. In response, `DTW-C++` was written to handle large time series efficiently, directly processing the raw data rather than first extracting features.

In contrast to existing tools available for time series clustering using DTW, such as `DTAIDistance` [@meert2020wannesm] and `TSlearn` [@Tavenard2020], `DTW-C++` offers significant improvements in speed and memory use, enabling larger datasets to be clustered. This is achieved by

Expand All @@ -44,11 +44,11 @@ In addition, `DTW-C++` offers the option of clustering using a new algorithm (de

The current functionality of the software is:

* Calculate DTW pairwise distances between all pairs of time series in a set, using a vector based approach, to reduce memory use. There is also the option to use a Sakoe-Chiba band to restrict warping in the DTW distance calculation [@Sakoe1978]. This speeds up the computation time, as well as being a useful constraint for some clustering scenarios (e.g., if an event must occur within a certain time window to be considered similar).
* Calculate DTW pairwise distances between all pairs of time series in a set, using a vector based approach to reduce memory use. There is also the option to use a Sakoe-Chiba band to restrict warping in the DTW distance calculation [@Sakoe1978]. This speeds up the computation time, as well as being a useful constraint for some clustering scenarios (e.g., if an event must occur within a certain time window to be considered similar).
* Produce a distance matrix containing all pairwise comparisons between each time series in the dataset.
* Split all time series into a predefined number of clusters, with a representative centroid time series for each cluster. This can be done using integer programming or k-medoids clustering, depending on user choice.
* Output the clustering cost, which is the sum of distances between every time series within each cluster and its cluster centroid.
* Find the silhouette score and elbow score for the clusters to aid the user decision on how many clusters, $k$, to include. The silhouette score is defined by the difference between the mean intra-cluster distance and the mean nearest-cluster distance, divided by the maximum of these two distances [@ROUSSEEUW198753]. This considers both the similarity of a time series to its own cluster as well as its dissimilarity from other clusters. The elbow method uses the cost of the clustering exercise, which sums together the distance between each time series and its centroid. Therefore the similarity of a time series to its own cluster is considered, but not its dissimilarity from other clusters.
* Find the silhouette score and elbow score for the clusters to aid the user decision on how many clusters, $k$, to include. The silhouette score is defined by the difference between the mean intra-cluster distance and the mean nearest-cluster distance, divided by the maximum of these two distances [@ROUSSEEUW198753]. This considers both the similarity of a time series to its own cluster as well as its dissimilarity from other clusters. The elbow score is based on the cost of the clustering exercise, which sums together the distance between each time series and its centroid. Therefore the similarity of a time series to its own cluster is considered, but not its dissimilarity from other clusters.

# Mathematical background

Expand All @@ -63,16 +63,15 @@ $$
y=(y_1, y_2, ..., y_m).
$$

The DTW distance is the sum of the Euclidean distance between each point and its matched point(s) in the other vector, as shown in \autoref{fig:warping_signals}(b). To find the DTW distance, the following constraints must be met:
The DTW distance is the sum of the Euclidean distance between each point and its matched point(s) in the other vector, as shown in \autoref{fig:warping_signals}. To find the DTW distance, the following constraints must be met:

1. The first and last elements of each series must be matched.
2. Only unidirectional forward movement through relative time is allowed, i.e., if $x_1$ is mapped to $y_2$ then $x_2$ may not be mapped to
$y_1$ (monotonicity).
2. Only unidirectional forward movement through relative time is allowed, i.e., if $x_1$ is mapped to $y_2$ then $x_2$ may not be mapped to $y_1$ (monotonicity).
3. Each point is mapped to at least one other point, i.e., there are no jumps in time (continuity).

![(a) Two time series with DTW pairwise alignment between each element, showing the one-to-many mapping properties of DTW. (b) Cost matrix $C$ for the two time series, showing the warping path and final DTW cost at $C_{14,13}$. \label{fig:warping_signals}](../media/warping_path-imageonline.co-merged.png)
![(a) Two time series with DTW pairwise alignment between each point, showing the one-to-many mapping properties of DTW. (b) Cost matrix $C$ for the two time series, showing the warping path and final DTW cost at $C_{13,12}$. \label{fig:warping_signals}](../media/warping_path-imageonline.co-merged.png)

Finding the optimal warping arrangement is an optimisation problem that can be solved using dynamic programming, which splits the problem into easier sub-problems and solves them recursively, storing intermediate solutions until the final solution is reached. To understand the memory-efficient method used in ``DTW-C++``, it is useful to first examine the full-cost matrix solution, as follows. For each pairwise comparison, an $n$ by $m$ matrix $C^{n\times m}$ is calculated, where each element represents the cumulative cost between series up to the points $x_i$ and $y_j$:
Finding the optimal warping arrangement is an optimisation problem that can be solved using dynamic programming, which splits the problem into easier sub-problems and solves them recursively, storing intermediate solutions until the final solution is reached. To understand the memory-efficient method used in ``DTW-C++``, it is useful to first examine the full cost matrix solution, as follows. For each pairwise comparison, an $n$ by $m$ matrix $C^{n\times m}$ is calculated, where each element represents the cumulative cost between series up to the points $x_i$ and $y_j$:

\begin{equation}
\label{c}
Expand All @@ -83,11 +82,11 @@ Finding the optimal warping arrangement is an optimisation problem that can be s
\end{cases}
\end{equation}

The final element $c_{n,m}$ is then the total cost, $C_{x,y}$, which provides the comparison metric between the two series $x$ and $y$. \autoref{fig:warping_signals} shows an example of this cost matrix $C$ and the warping path through it.
The final element in the matrix $c_{n,m}$ is then the total cost, and this provides the metric for comparing the two series $x$ and $y$. \autoref{fig:warping_signals} shows an example of this cost matrix $C$ and the warping path through it.

## Clustering

For the clustering problem, only this final cost for each pairwise comparison is required; the actual warping path (or mapping of each point in one time series to the other) is superfluous for k-medoids clustering. The memory complexity of the cost matrix $C$ is $O(nm)$, so as the length of the time series increases, the memory required increases greatly. Therefore, significant reductions in memory can be made by not storing the entire $C$ matrix. When the warping path is not required, only a vector containing the previous row for the current step of the dynamic programming sub-problem is required (i.e., the previous three values $c_{i-1,j-1}$, $c_{i-1,j}$, $c_{i,j-1}$), as indicated in \autoref{c}.
For the clustering algorithm, only the final cost for each pairwise comparison is required; the actual warping path (i.e., mapping between time series) is superfluous. The memory complexity of the cost matrix $C$ is $\mathcal{O}(nm)$, so as the length of the time series increases, the memory required increases greatly. Therefore, significant reductions in memory can be made by not storing the entire $C$ matrix. When the warping path is not required, only a vector containing the previous row for the current step of the dynamic programming sub-problem is required (i.e., the previous three values $c_{i-1,j-1}$, $c_{i-1,j}$, $c_{i,j-1}$), as indicated in \autoref{c}.

The DTW distance $C_{x,y}$ is found for each pairwise comparison. As shown in \ref{fig:c_to_d}, pairwise distances are then stored in a separate symmetric matrix, $D^{p\times p}$, where $p$ is the total number of time series in the clustering exercise. In other words, the element $d_{i,j}$ gives the distance between time series $i$ and $j$.

Expand Down

0 comments on commit 3486818

Please sign in to comment.