tsml-eval Data Format

tsml-eval primarily uses numpy arrays as the datatype of choice when running experiments. The type of numpy array used however, will depend on the dataset characteristics (i.e. whether it is equal or unequal length) and the learning task.

Classification, clustering and regression use collections of time series. Forecasting using single time series.

tsml-eval datasets

Time Series Collections

There are two types of collection datatypes used in tsml-eval: - A 3D numpy array of shape (n_samples, n_channels, n_timestamps) for equal length time series. - A list of 2D numpy arrays of shape (n_channels, n_timestamps) for unequal length time series.

These are both design to accommodate multivariate time series, where n_channels is the number of variables in the time series. For univariate time series, n_channels is 1.

Below is an example for these formats.

[1]:
from tsml.datasets import load_minimal_chinatown

X, y = load_minimal_chinatown()

print("Shape:", X.shape)
print("Type:", type(X))
print(X[:5])
Shape: (40, 1, 24)
Type: <class 'numpy.ndarray'>
[[[ 573.  375.  301.  212.   55.   34.   25.   33.  113.  143.  303.
    615. 1226. 1281. 1221. 1081.  866. 1096. 1039.  975.  746.  581.
    409.  182.]]

 [[ 394.  264.  140.  144.  104.   28.   28.   25.   70.  153.  401.
    649. 1216. 1399. 1249. 1240. 1109. 1137. 1290. 1137.  791.  638.
    597.  316.]]

 [[ 603.  348.  176.  177.   47.   30.   40.   42.  101.  180.  401.
    777. 1344. 1573. 1408. 1243. 1141. 1178. 1256. 1114.  814.  635.
    304.  168.]]

 [[ 428.  309.  199.  117.   82.   43.   24.   64.  152.  183.  408.
    797. 1288. 1491. 1523. 1460. 1365. 1520. 1700. 1797. 1596. 1139.
    910.  640.]]

 [[ 372.  310.  203.  133.   65.   39.   27.   36.  107.  139.  329.
    651.  990. 1027. 1041.  971. 1104.  844. 1023. 1019.  862.  643.
    591.  452.]]]

The labels for each time series are stored in a 1D numpy array.

[2]:
print("Shape:", y.shape)
print("Type:", type(y))
print(y)
Shape: (40,)
Type: <class 'numpy.ndarray'>
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]

numpy arrays do not support unequal length time series, so a list of 2D numpy arrays is used instead.

[3]:
from tsml.datasets import load_minimal_japanese_vowels

X, _ = load_minimal_japanese_vowels()

print("Size:", len(X))
print("Type:", type(X))
print("Case 1 shape:", X[0].shape)
print("Case 2 shape:", X[1].shape)
print("Case 1 type:", type(X[0]))
Size: 40
Type: <class 'list'>
Case 1 shape: (12, 20)
Case 2 shape: (12, 26)
Case 1 type: <class 'numpy.ndarray'>

Collection datatypes can be loaded from files in the aeon .ts format using the tsml loader below.

[4]:
from tsml.datasets import load_from_ts_file

X, y = load_from_ts_file(
    "../tsml_eval/datasets/MinimalChinatown/MinimalChinatown_TRAIN.ts"
)
X.shape
[4]:
(20, 1, 24)

Single Time Series

Functionality for single series tasks in tsml-eval is currently limited. Using current functions, the best datatype to use is a 1D numpy array.

[5]:
import pandas as pd

X = pd.read_csv(
    "../tsml_eval/datasets/ShampooSales/ShampooSales_TRAIN.csv",
    index_col=0,
).squeeze("columns")
X = X.astype(float).to_numpy()

print("Shape:", X.shape)
print("Type:", type(X))
print(X)
Shape: (24,)
Type: <class 'numpy.ndarray'>
[266.  145.9 183.1 119.3 180.3 168.5 231.8 224.5 192.8 122.9 336.5 185.9
 194.3 149.5 210.1 273.3 191.4 287.  226.  303.6 289.9 421.6 264.5 342.3]

Generated using nbsphinx. The Jupyter notebook can be found here. binder