Hi all, I am the creator of STUMPY and wanted to thank you for your interest. Please feel free to post questions on our Github issues and we'll try to assist where we can.

But this is not about clustering. It's about figuring out to what extent a certain subclass of features, namely the 'shapelets', are statistically significantly associated with a pre-defined binary outcome.

The paper you mentioned is interesting, though, because it shows an issue that many algorithms are privy to: if the number of samples/features gets too large, at some point, you are only comparing _means_.

(We are working on a paper to show the issues of this when it comes to time series classification.)

Their math in their description of their data is in error: They need to state that the T_i (T with a subscript i), for i = 0, 1, 2, ..., n are distinct.

More standard would be a function d: {0, 1, ..., n} --> R^{1 x m} x {0, 1}.

Seems to be standard terminology for time series classification to me, to be honest. I think the approach would also work if there are duplicates in the data. Although the estimate would be overly optimistic, right?

With their notation they have not specified that the T's are unique. So, a first fix up would be just to state that the T's were distinct. And it would help to be explicit that i from 0, 1, 2, ... corresponded to increasing time. Moreover, is the data equally spaced in time? Likely, yes, and in that case, clearly say so.

related: Matrix Profiles for time series https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

See Stumpy for a handy library to get this working quickly (written in Python): https://github.com/TDAmeritrade/stumpy

Hi all, I am the creator of STUMPY and wanted to thank you for your interest. Please feel free to post questions on our Github issues and we'll try to assist where we can.

I am still a little confused about the real world application of MatrixProfile. It looks really good but once an MP is made then what ?

Can this be automated to say for example - Based on your window, here are all the anomalies.

Don’t forget: „Clustering of Time Series Subsequences is Meaningless“ : https://www.cs.ucr.edu/~eamonn/meaningless.pdf

But this is not about clustering. It's about figuring out to what extent a certain subclass of features, namely the 'shapelets', are statistically significantly associated with a pre-defined binary outcome.

The paper you mentioned is interesting, though, because it shows an issue that many algorithms are privy to: if the number of samples/features gets too large, at some point, you are only comparing _means_.

(We are working on a paper to show the issues of this when it comes to time series classification.)

Where to store time series data for further analysis? It is possible to use Prometheus for this - see https://medium.com/@valyala/analyzing-prometheus-data-with-e...

Their math in their description of their data is in error: They need to state that the T_i (T with a subscript i), for i = 0, 1, 2, ..., n are distinct.

More standard would be a function d: {0, 1, ..., n} --> R^{1 x m} x {0, 1}.

Seems to be standard terminology for time series classification to me, to be honest. I think the approach would also work if there are duplicates in the data. Although the estimate would be overly optimistic, right?

With their notation they have not specified that the T's are unique. So, a first fix up would be just to state that the T's were distinct. And it would help to be explicit that i from 0, 1, 2, ... corresponded to increasing time. Moreover, is the data equally spaced in time? Likely, yes, and in that case, clearly say so.

No, i indexes the patient, not time. (T_0, y_0) is one patients entire time series.

This sure reads and looks like technical analysis indicators for time series data.

It's useful though - example: 5 day MA of disk errors rises over the 15 day == likely failure