This tutorial is about sktime - a unified framework for machine learning with time series. sktime features various time series algorithms and modular tools for sktime is a widely used scikit-learn compatible library for learning with time series.
sktime
is easily extensible by anyone, and interoperable with the pydata/numfocus stack.
This tutorial explains how to use sktime for three learning tasks with independent instances of time series: time series classification, regression, clustering. It also explains their close connection to time series distances, kernels, and time series alignment, and how to flexibly combine such estimators to classifiers, regressors, clusterers with custom distances/kernels or feature extraction steps.
Also recommended:
🎥 general sktime intro tutorial from PyData Global 2021
📺 youtube video of sktime intro at PyData Global 2021
In the tutorial, we will move through notebooks section by section.
You have different options how to run the tutorial notebooks:
- Run the notebooks in the cloud on Binder - for this you don't have to install anything!
- Run the notebooks on your machine. Clone this repository, get conda, install the required packages (
sktime
,seaborn
,jupyter
,dtw-python
) in an environment, and open the notebooks with that environment. For detail instructions, see below. For troubleshooting, see sktime's more detailed installation instructions. - or, use python venv, and/or an editable install of this repo as a package. Instructions below.
Please let us know on the sktime discord if you have any issues during the conference, or join to ask for help anytime.
In time series or sequence analysis, data often takes the form of multiple, independent observations of the same process, where one independent observation is an entire indexed series. Examples are sequential observations of a patient’s lab values, for many patients; or, spectrograms of independent samples (where the sequence is indexed by wavelength, not time).
As with tabular data, common learning tasks associated with such data are:
- time series or sequence classification, where an algorithm is trained, on examples, to assign one of a set of labels to each time series or sequence. For instance, sick or healthy, for a patient; the type of substance, for a spectrogram
- time series or sequence regression, where an algorithm is trained to assign a real number to each time series or sequence. For instance, length of stay for a patient; the alcohol content as a percentage, for a spectrogram
- time series or sequence clustering, where sequences are assigned a cluster, or are arranged according to their similarity with each other
Unlike with tabular data, sequence and shape information is crucial in the time series (or sequence) setting, and algorithms tend to be heavily based on feature extraction steps, distances or kernel functions that are specific to time series and sequences, and/or registration or alignment of the series.
Of course, feature extraction followed by application of a “standard” tabular classifier etc constitutes an important kind of “simpler” baseline for above estimation tasks.
sktime provides unified interfaces, and ample state-of-art functionality to flexibly construct estimators of the above kind, specifically:
- native implementations and direct interfaces to state-of-art references in time series classification, regression, and clustering, as “atomic” algorithms; e.g., k-means clusterer
- sequential pipeline functionality to combine these with time series transformers, also natively available under the unified sktime transformer interfaces ; e.g., feature extractors, feature union
- performant implementations of time series distances and time series kernels, exposed under the unified interface abstraction of “pairwise transformers”; e.g., time warping distance
- adaptors and compositors to use classical tabular transformers, distances, kernels in a time series setting; e.g., using the sklearn Gaussian RBF kernel on a flattened time series
- composition and combination estimators for time series distances and kernels, e.g., “n-th differenced distance” from ordinary and pairwise transformers; or, “mean sample kernels”
- interfaces and adaptors for sequence alignment as a separate learning task, which can be used to obtain distances, kernels, and estimators
The framework in sktime is highly customizable and composable; for instance, users can construct a custom time warping distance from an alignment algorithm, combine that with n-th differencing as pre-processing, and use this modified time warping distance in the k-nearest-neighbors algorithm for time series, all parameters and choices exposed and tunable via, say, grid search.
sktime is also highly extensible, and provides guided extension templates for any of the involved types of objects (distances, kernels, aligners) or estimators (time series classification, regression, clustering), which allow users to implement custom components, ready for use with the above-mentioned composition patterns that sktime provides.
If you're interested in contributing to sktime, you can find out more how to get involved here.
Any contributions are welcome, not just code!
To run the notebooks locally, you will need:
- a local repository clone
- a python environment with required packages installed
To clone the repository locally:
git clone https://github.com/sktime/sktime-tutorial-pydata-london-2023.git
- Create a python virtual environment:
conda create -y -n pydata_sktime python=3.9
- Install required packages:
conda install -y -n pydata_sktime pip sktime seaborn jupyter dtw-python
- Activate your environment:
conda activate pydata_sktime
- If using jupyter: make the environment available in jupyter:
python -m ipykernel install --user --name=pydata_sktime
- Create a python virtual environment:
python -m venv .venv
- Activate your environment:
source .venv/bin/activate
- Install the requirements:
pip install sktime seaborn jupyter dtw-python
- If using jupyter: make the environment available in jupyter:
python -m ipykernel install --user --name=pydata_sktime