Distance Timings

Our publication uses numba based distances from the aeon packages in our clustering implementations and experiments. In this notebook we compare the performance of the aeon distances to the dtw package, tslearn and sktime implementations of DTW.

We use the follow versions for each package in the default output:

  • aeon-0.3.0

  • dtw-1.4.0

  • tslearn-0.5.3.2

  • sktime-0.20.0

[1]:
import time
import warnings

import matplotlib.pyplot as plt
import pandas as pd
from aeon.distances import dtw_distance as aeon_dtw
from aeon.distances.tests._utils import create_test_distance_numpy
from dtw import dtw as dtw_python_dtw
from sktime.dists_kernels import DtwDist as sktime_dtw
from tslearn.metrics import dtw as tslearn_dtw

warnings.filterwarnings("ignore")  # Hide warnings
[2]:
def timing_experiment(x, y, distance_callable, average=10, params=None):
    if params is None:
        params = {}
    total_time = 0
    for i in range(0, average):
        start = time.time()
        distance_callable(x, y, **params)
        total_time += time.time() - start

    return total_time / average


# dummy run to compile numba etc
x = create_test_distance_numpy(1, 10, 10, random_state=0)[0]
timing_experiment(x, x, aeon_dtw, average=1)
timing_experiment(
    x[0], x[0], dtw_python_dtw, params={"dist": lambda x, y: (x - y) ** 2}, average=1
)
timing_experiment(x, x, tslearn_dtw, average=1)
timing_experiment(x, x, sktime_dtw(), average=1)
x
[2]:
array([[ 0.88202617,  0.2000786 ,  0.48936899,  1.1204466 ,  0.933779  ,
        -0.48863894,  0.47504421, -0.0756786 , -0.05160943,  0.20529925],
       [ 0.07202179,  0.72713675,  0.38051886,  0.06083751,  0.22193162,
         0.16683716,  0.74703954, -0.10257913,  0.15653385, -0.42704787],
       [-1.27649491,  0.3268093 ,  0.4322181 , -0.37108251,  1.13487731,
        -0.72718284,  0.02287926, -0.09359193,  0.76638961,  0.73467938],
       [ 0.07747371,  0.18908126, -0.44389287, -0.99039823, -0.17395607,
         0.07817448,  0.61514534,  0.60118992, -0.19366341, -0.15115138],
       [-0.52427648, -0.71000897, -0.8531351 ,  0.9753877 , -0.25482609,
        -0.21903715, -0.62639768,  0.38874518, -0.80694892, -0.10637014],
       [-0.44773328,  0.19345125, -0.25540257, -0.59031609, -0.01409111,
         0.21416594,  0.03325861,  0.15123595, -0.31716105, -0.18137058],
       [-0.33623022, -0.17977658, -0.40657314, -0.8631413 ,  0.08871307,
        -0.20089047, -0.81509917,  0.23139113, -0.45364918,  0.0259727 ],
       [ 0.36454528,  0.06449146,  0.56970034, -0.61741291,  0.20117082,
        -0.34240505, -0.43539857, -0.28942483, -0.15577627,  0.02808267],
       [-0.58257492,  0.45041324,  0.23283122, -0.76812184,  0.7441261 ,
         0.94794459,  0.58938979, -0.08996242, -0.53537631,  0.52722586],
       [-0.20158847,  0.61122254,  0.10413749,  0.48831952,  0.1781832 ,
         0.35328658,  0.00525001,  0.89293525,  0.06345605,  0.20099468]])

Univariate Distance Timings

In this experiment we compare performance on a univariate series, raising the series length by 50 each step and averaging over 10 runs.

[3]:
aeon_timing = []
dtw_python_timing = []
tslearn_timing = []
sktime_timing = []
lengths = []

for i in range(50, 550, 50):
    lengths.append(i)
    distance_m_d = create_test_distance_numpy(2, 1, i, random_state=0)
    x = distance_m_d[0][0]
    y = distance_m_d[1][0]

    aeon_timing.append(timing_experiment(x, y, aeon_dtw))
    dtw_python_timing.append(
        timing_experiment(
            x, y, dtw_python_dtw, params={"dist": lambda x, y: (x - y) ** 2}
        )
    )
    tslearn_timing.append(timing_experiment(x, y, tslearn_dtw))
    sktime_timing.append(
        timing_experiment(x.reshape((1, 1, i)), y.reshape((1, 1, i)), sktime_dtw())
    )
[4]:
print(aeon_timing)
print(tslearn_timing)
print(dtw_python_timing)
print(sktime_timing)
[0.0015954971313476562, 0.0, 0.00019965171813964843, 0.0001994609832763672, 0.00019311904907226562, 0.0003989219665527344, 0.0005983829498291016, 0.0007977962493896484, 0.0009973526000976562, 0.00139617919921875]
[0.0, 0.0, 0.0, 0.0001995563507080078, 0.00019979476928710938, 0.0005982398986816406, 0.001196908950805664, 0.0009979248046875, 0.0014006137847900391, 0.001595783233642578]
[0.004192161560058594, 0.01595149040222168, 0.03331141471862793, 0.05925397872924805, 0.09237089157104492, 0.1434168338775635, 0.18961143493652344, 0.24489173889160157, 0.37693352699279786, 0.38398103713989257]
[0.03690385818481445, 0.03172025680541992, 0.03410296440124512, 0.0402984619140625, 0.043180322647094725, 0.051262712478637694, 0.05744051933288574, 0.10990238189697266, 0.14532618522644042, 0.11210570335388184]
[5]:
plt.plot(aeon_timing, label="aeon")
plt.plot(tslearn_timing, label="tslearn")
plt.plot(dtw_python_timing, label="dtw")
plt.plot(sktime_timing, label="sktime")
plt.legend()
plt.xticks(list(range(len(lengths))), lengths)
plt.show()
../../../_images/publications_2023_distance_based_clustering_package_distance_timing_6_0.png

Multivariate Distance Timings

In this experiment we compare performance on a multivariate series, raising both the series length and number of channels by 50 each step and averaging over 10 runs.

The dtw package does not support multivariate series, so we exclude it.

[6]:
aeon_timing = []
tslearn_timing = []
sktime_timing = []
lengths = []

for i in range(50, 550, 50):
    lengths.append(i)
    distance_m_d = create_test_distance_numpy(2, i, i, random_state=0)
    x = distance_m_d[0]
    y = distance_m_d[1]

    aeon_timing.append(timing_experiment(x, y, aeon_dtw))
    # tslearn expects the shape tp be (m, d) instead of (d, m)
    tslearn_timing.append(timing_experiment(x.transpose(), y.transpose(), tslearn_dtw))
    sktime_timing.append(
        timing_experiment(x.reshape((1, i, i)), y.reshape((1, i, i)), sktime_dtw())
    )
[7]:
print(aeon_timing)
print(tslearn_timing)
print(sktime_timing)
[0.0001918315887451172, 0.000798177719116211, 0.00299220085144043, 0.007026290893554688, 0.014661788940429688, 0.0276275634765625, 0.04847054481506348, 0.0844810962677002, 0.09893527030944824, 0.16029653549194336]
[0.08747973442077636, 0.0005991458892822266, 0.0024773597717285155, 0.00638275146484375, 0.012765932083129882, 0.02264060974121094, 0.05485291481018066, 0.0596405029296875, 0.07699403762817383, 0.12229204177856445]
[0.00932307243347168, 0.010172033309936523, 0.013563966751098633, 0.02044649124145508, 0.03320589065551758, 0.05465383529663086, 0.08779778480529785, 0.21661005020141602, 0.27059736251831057, 0.333675479888916]
[8]:
plt.plot(aeon_timing, label="aeon")
plt.plot(tslearn_timing, label="tslearn")
plt.plot(sktime_timing, label="sktime")
plt.legend()
plt.xticks(list(range(len(lengths))), lengths)
plt.show()
../../../_images/publications_2023_distance_based_clustering_package_distance_timing_10_0.png

Larger multivariate experiment

We keep the series compared above small to keep the notebook runtime low. In the following pre-computed experiment we compare the performance for larger series length and dimensionality averaged over 30 runs per step.

[9]:
# start 1000, end 10000, step 500, 30 average runs
csv = pd.read_csv("results/distance_notebook_timings.csv", index_col=0)
print(csv)
             1000      1500      2000      2500      3000      3500      4000  \
aeon     0.005453  0.012845  0.022739  0.037193  0.051308  0.071451  0.092952
tslearn  0.012304  0.015375  0.028867  0.047718  0.063504  0.086363  0.110766
sktime   0.138374  0.182864  0.222571  0.289339  0.363408  0.457018  0.537899

             4500      5000      5500      6000      6500      7000      7500  \
aeon     0.116585  0.142927  0.178241  0.207480  0.243924  0.287327  0.324288
tslearn  0.148099  0.170989  0.210569  0.249226  0.293594  0.336350  0.396900
sktime   0.639533  0.794833  0.921391  1.073838  1.230945  1.400597  1.703857

             8000      8500      9000      9500     10000
aeon     0.437878  0.404185  0.452040  0.501346  0.561733
tslearn  0.503390  0.466680  0.522787  0.585782  0.649676
sktime   1.772326  1.913860  2.126502  2.408306  2.594621
[10]:
plt.plot(csv.iloc[0], label="aeon")
plt.plot(csv.iloc[1], label="tslearn")
plt.plot(csv.iloc[2], label="sktime")
plt.legend()
lengths = list(csv.columns.values)
plt.xticks(list(range(0, len(lengths), 2)), lengths[::2])
plt.show()
../../../_images/publications_2023_distance_based_clustering_package_distance_timing_13_0.png

Generated using nbsphinx. The Jupyter notebook can be found here. binder