A Review and Evaluation of Elastic Distance Functions for Time Series Clustering

Webpage and repo package to support the paper “A Review and Evaluation of Elastic Distance Functions for Time Series Clustering” submitted to Springer Knowledge and Information Systems (KAIS).

Additional notebooks:

Our results files are stored here.

Datasets

The 112 UCR archive datasets are available at timeseriesclassification.com.

Install

To install the latest version of the package with up-to-date algorithms, run:

pip install tsml-eval

To install the package at the time of publication, run:

pip install tsml-eval==0.1.1

To install dependency versions used at the time of publication, use the publication requirements.txt:

pip install -r tsml_eval/publications/2023/distance_based_clustering/static_publication_reqs.txt

Usage

Command Line

Run run_distance_experiments.py with the following arguments:

  1. Path to the data directory

  2. Path to the results directory

  3. The name of the model to run (see set_distance_classifier.py, i.e. KMeans-dtw, KMeans-msm, KMedoids-dtw)

  4. The name of the problem to run

  5. The resample number to run (0 is base train/test split)

i.e. to run the ItalyPowerDemand problem using KMeans with the MSM distance on the base train/test split:

python tsml_eval/publications/2023/distance_based_clustering/run_distance_experiments.py data/ results/ KMeans-msm ItalyPowerDemand 0

Using Distance-based Clusterers

Our clusterers and distances are available in the aeon Python package.

The clusterers used in our experiments extend the scikit-learn interface and can also be used like their estimators:

[1]:
import warnings

warnings.filterwarnings("ignore")

from aeon.clustering import TimeSeriesKMeans
from aeon.performance_metrics.clustering import clustering_accuracy_score
from tsml.datasets import load_minimal_chinatown

from tsml_eval.publications.y2023.distance_based_clustering import (
    _set_distance_clusterer,
)

Data can be loaded using whichever method is most convenient, but should be formatted as either a 3D numpy array of shape (n_samples, n_channels, n_timesteps) or a list of length (n_samples) containing 2D numpy arrays of shape (n_channels, n_timesteps).

A function is available for loading from .ts files.

[2]:
# load example classification dataset
X_train, y_train = load_minimal_chinatown("TRAIN")
X_test, y_test = load_minimal_chinatown("TEST")

# data can be loaded from .ts files using the following function
# from tsml.datasets import load_from_ts_file
# X, y = load_from_ts_file("data/data.ts")

print(type(X_train), type(y_train))
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
X_train[:5]
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
(20, 1, 24) (20,)
(20, 1, 24) (20,)
[2]:
array([[[ 573.,  375.,  301.,  212.,   55.,   34.,   25.,   33.,  113.,
          143.,  303.,  615., 1226., 1281., 1221., 1081.,  866., 1096.,
         1039.,  975.,  746.,  581.,  409.,  182.]],

       [[ 394.,  264.,  140.,  144.,  104.,   28.,   28.,   25.,   70.,
          153.,  401.,  649., 1216., 1399., 1249., 1240., 1109., 1137.,
         1290., 1137.,  791.,  638.,  597.,  316.]],

       [[ 603.,  348.,  176.,  177.,   47.,   30.,   40.,   42.,  101.,
          180.,  401.,  777., 1344., 1573., 1408., 1243., 1141., 1178.,
         1256., 1114.,  814.,  635.,  304.,  168.]],

       [[ 428.,  309.,  199.,  117.,   82.,   43.,   24.,   64.,  152.,
          183.,  408.,  797., 1288., 1491., 1523., 1460., 1365., 1520.,
         1700., 1797., 1596., 1139.,  910.,  640.]],

       [[ 372.,  310.,  203.,  133.,   65.,   39.,   27.,   36.,  107.,
          139.,  329.,  651.,  990., 1027., 1041.,  971., 1104.,  844.,
         1023., 1019.,  862.,  643.,  591.,  452.]]])

Clusterers can be built using the fit method and predictions can be made using predict.

[3]:
# build a TSF classifier and make predictions
km = TimeSeriesKMeans(distance="dtw", n_clusters=2, random_state=0)
km.fit(X_train)
km.predict(X_test)
[3]:
array([1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0])

The labels_ attribute can be used to obtain the initial cluster labels for each sample instead of using predict on the initial data.

[4]:
km.labels_
[4]:
array([0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

Here we run some of the clusterers from the publication and find the clustering accuracy for them on our example dataset.

[5]:
clusterers = [
    "KMeans-dtw",
    "KMeans-msm",
    "KMedoids-dtw",
]

cl_acc_train = []
cl_acc_test = []
for clusterer_name in clusterers:
    # Select a clusterer by name, see set_distance_clusterer.py for options
    clusterer = _set_distance_clusterer(clusterer_name, random_state=0)

    # fit and predict
    clusterer.fit(X_train)
    test_cl = clusterer.predict(X_test)

    cl_acc_train.append(clustering_accuracy_score(y_train, clusterer.labels_))
    cl_acc_test.append(clustering_accuracy_score(y_test, test_cl))

print(cl_acc_train)
print(cl_acc_test)
[0.35, 0.55, 0.3]
[0.55, 0.6, 0.5]

Generated using nbsphinx. The Jupyter notebook can be found here. binder