[BUG] Inconsistent Sbd distance with `tslearn` and other implementations #2674

tanishy7777 · 2025-03-22T18:53:57Z

Describe the bug

While implementing K-shape clustering(#2661) I noticed that our implementation for the sbd_distance differs slightly from how tslearn handles it.

This doesn't matter for univariate series but for multivariate series we do see differences.

sbd_distance: finds the distance for each channel independently and then takes its average
normalized_cc: finds the correlations for each of the channels and then sums the max of each channels, and then normalizes using the norm of the the entire multivariate series.

I found another implementation of kshape online, one of its contributors is the original author of the kshapes paper (https://dl.acm.org/doi/pdf/10.1145/2723372.2737793). They handle this the same way as tslearn. https://github.com/TheDatumOrg/kshape-python

I am not sure if this is an intentional choice but this changes the final clustering output.
If we use aeon's sbd distances for k-shape clustering we will have to change the test cases to reflect the same, but then we wouldn't have something to compare it against.

Steps/Code to reproduce the bug

import numpy as np
from aeon.distances import sbd_distance
from tslearn.metrics.cycc import normalized_cc

# Multivariate time series
np.random.seed(1)
x = np.random.rand(6, 100)
y = np.random.rand(6, 100)
a = sbd_distance(x, y, standardize=False)
print(x.shape, y.shape)

z = np.transpose(x, (1, 0))
w = np.transpose(y, (1, 0))
b = 1.0 - normalized_cc(z, w).max()
print(a, b)


# Univariate time series
np.random.seed(1)
x = np.random.rand(1, 200)
y = np.random.rand(1, 200)
print(x.shape, y.shape)

a = sbd_distance(x, y, standardize=False)

x = np.transpose(x, (1, 0))
y = np.transpose(y, (1, 0))

b = 1.0 - normalized_cc(x, y).max()
print(a, b)

Expected results

Expected result is that sbd distance computed for both univariate and multivariate time series are consistent with tslearn and https://github.com/TheDatumOrg/kshape-python

Actual results

1st example is multivariate time series which is inconsistent with tslearn and https://github.com/TheDatumOrg/kshape-python

Versions

System:
python: 3.12.8 | packaged by conda-forge | (main, Dec 5 2024, 14:06:27) [MSC v.1942 64 bit (AMD64)]
executable: D:\Open Source\aeon\aeon-venv\Scripts\python.exe
machine: Windows-11-10.0.22631-SP0
Python dependencies:
aeon: 1.0.0
pip: 24.3.1
setuptools: 75.8.0
scikit-learn: 1.5.2
numpy: 1.26.4
numba: 0.60.0
scipy: 1.14.1
pandas: 2.2.3

The text was updated successfully, but these errors were encountered:

…kit#2674) * Updated sbd_distance() to handle multivariate data consistently with tslearn and other implementations * added _multivariate_sbd_distance() which finds the correlations for each of the channels and then normalizes using the norm of the multivariate series.

…kit#2674) * Updated sbd_distance() to handle multivariate data consistently with tslearn and other implementations * added _multivariate_sbd_distance() which finds the correlations for each of the channels and then normalizes using the norm of the multivariate series. a61927f

SebastianSchmidl · 2025-04-13T13:50:55Z

We now got two PRs trying to fix the SBD distance:

Both PRs currently fail the CI and do not prove that they are correct (wrt. to the original implementation). #2715 seems to be closer to the formula proposed in the paper. I would suggest the following:

Please write each other.
Fix your solutions and check their correctness by comparing the results to the original implementation, the current aeon implementation (univariate), and to the respective other solution.
Benchmark your implementations. I would then merge the faster/better-scaling solution. (benchmark inspiration: [ENH] SBD and MSM: Support pairwise distance calculation for unequal length time series #1287 (comment))

tanishy7777 · 2025-04-13T13:55:59Z

Moved from #2661 (comment)

SBD(official implementation and tslearn) works with unequal length time series but requires them to have equal no of channels.

But in the aeom/tests/test_distances.py under the UNEQUAL_LENGTH_SUPPORT for multivariate case:
We are checking for both unequal length and unequal no of channels? Shouldn't we keep the channels same and if we do want to check if the distance works with different no of channels check that separately?

# ================== Test unequal length ==================
    if dist["name"] in UNEQUAL_LENGTH_SUPPORT_DISTANCES:
        #prev code here       

        # Test multivariate unequal length of shape (n_channels, n_timepoints)
        _validate_distance_result(
            make_example_2d_numpy_series(5, 5, random_state=1),
            make_example_2d_numpy_series(10, 10, random_state=2),
            dist["name"],
            dist["distance"],
            dist["symmetric"],
            _expected_distance_results[dist["name"]][3],

tanishy7777 · 2025-04-13T13:57:16Z

For now I have set the unequal_support=False in aeon/distances/_distance.py to make the test_distances pass
and opened a PR #2742

Below is the equivalence between the sbd and tslearn

Also tested the equivalence for multivariate and pairwise_distances

tanishy7777 · 2025-04-13T14:08:26Z

We now got two PRs trying to fix the SBD distance:

* [[BUG] Updated sbd_distance() to handle multivariate series (#2674) #2715](https://github.com/aeon-toolkit/aeon/pull/2715) by [@pvprajwal](https://github.com/pvprajwal)

* [[BUG] Fixes sbd distances for multivariate case #2742](https://github.com/aeon-toolkit/aeon/pull/2742) by [@tanishy7777](https://github.com/tanishy7777)

Both PRs currently fail the CI and do not prove that they are correct (wrt. to the original implementation). #2715 seems to be closer to the formula proposed in the paper. I would suggest the following:

1. Please write each other.

2. Fix your solutions and check their correctness by comparing the results to the original implementation, the current aeon implementation (univariate), and to the respective other solution.

3. Benchmark your implementations. I would then merge the faster/better-scaling solution. (benchmark inspiration: [[ENH] SBD and MSM: Support pairwise distance calculation for unequal length time series #1287 (comment)](https://github.com/aeon-toolkit/aeon/pull/1287#issuecomment-2018039516))

The CI is now passing for #2742 and the correctness is tested in #2674 (comment) and #2674 (comment). Will benchmark my implementation.

Also since the Kshape PR #2676 is dependent on this, I went ahead with the PR as #2715 hasn't been active since last week.

I am open to collaborate on this PR please let me know your thoughts on this @pvprajwal

tanishy7777 · 2025-04-13T14:58:27Z

Equivalence between original implementation, tslearn's implementation and fixed implementation as per (#2742)

tanishy7777 · 2025-04-13T22:07:18Z

We now got two PRs trying to fix the SBD distance:

* [[BUG] Updated sbd_distance() to handle multivariate series (#2674) #2715](https://github.com/aeon-toolkit/aeon/pull/2715) by [@pvprajwal](https://github.com/pvprajwal)

* [[BUG] Fixes sbd distances for multivariate case #2742](https://github.com/aeon-toolkit/aeon/pull/2742) by [@tanishy7777](https://github.com/tanishy7777)

Both PRs currently fail the CI and do not prove that they are correct (wrt. to the original implementation). #2715 seems to be closer to the formula proposed in the paper. I would suggest the following:

1. Please write each other.

2. Fix your solutions and check their correctness by comparing the results to the original implementation, the current aeon implementation (univariate), and to the respective other solution.

3. Benchmark your implementations. I would then merge the faster/better-scaling solution. (benchmark inspiration: [[ENH] SBD and MSM: Support pairwise distance calculation for unequal length time series #1287 (comment)](https://github.com/aeon-toolkit/aeon/pull/1287#issuecomment-2018039516))

I had a question about the warmups for the benchmark in https://github.com/SebastianSchmidl/aeon/blob/bench/numba-distances/numba-benchmark/benchmark.py

Here, for warming up for the 3D numpy array case, you have iterated over various values of i. Is this needed? As far as I understood from the numba docs, the functions are cached based on the function signature and in this case it will be cached for func(ndarray[float64, 3d, C], ndarray[float64, 3d, C]). And since for same no of channels the flow of execution is same (calls the same univariate function dispatcher (specific to the PR #1287)). So instead of iterating for different i could we have just iterated some no of of times for a specific univariate timeseries dataset like (20,1,50) called 5 times.

Just asking out of curiosity, to better understand numba and writing better benchmarks

# warmup for Numba JIT
    print("Warming up Numba JIT...")
    ts1 = rng.random(100)
    ts2 = rng.random(100)
    for i in [1, 2, 5, 10]:
        for func in distance_funcs:
            func(ts1.reshape(i, 1, -1), ts2.reshape(i, 1, -1))
            func(ts1.reshape(i, 2, -1), ts2.reshape(i, 2, -1))

SebastianSchmidl · 2025-04-15T16:38:20Z

Sure, you could just reuse the same input array multiple times for the warmup. But it is always a good idea to have your warmup input as close as possible to the benchmark input, and I was already iterating over i, so I just used it.

SebastianSchmidl · 2025-04-15T16:48:11Z

https://github.com/tanishy7777/aeon/blob/ceb552e5b4c1ecafbe907da9aab5e0dd7fdeed21/benchmarks/benchmark.py#L68-L69

Why do you execute the function twice?

tanishy7777 · 2025-04-15T20:41:40Z

https://github.com/tanishy7777/aeon/blob/ceb552e5b4c1ecafbe907da9aab5e0dd7fdeed21/benchmarks/benchmark.py#L68-L69

Why do you execute the function twice?

Oh that's not supposed to be there, here are the updated benchmarks
Is this acceptable or should I try to match it with the original implementation?

tanishy7777 · 2025-04-15T20:42:10Z

Sure, you could just reuse the same input array multiple times for the warmup. But it is always a good idea to have your warmup input as close as possible to the benchmark input, and I was already iterating over i, so I just used it.

Ah ok makes sense

tanishy7777 added the bug Something isn't working label Mar 22, 2025

tanishy7777 mentioned this issue Mar 22, 2025

[ENH] Implement K-Shape clusterer #2676

Open

5 tasks

tanishy7777 changed the title ~~[BUG] Inconsistent Sbd distance with tslearn and other implementations~~ Inconsistent Sbd distance with tslearn and other implementations Mar 23, 2025

MatthewMiddlehurst changed the title ~~Inconsistent Sbd distance with tslearn and other implementations~~ [BUG] Inconsistent Sbd distance with tslearn and other implementations Mar 23, 2025

MatthewMiddlehurst added clustering Clustering package distances Distances package labels Mar 23, 2025

pvprajwal linked a pull request Mar 31, 2025 that will close this issue

[BUG] Updated sbd_distance() to handle multivariate series (#2674) #2715

Open

5 tasks

SebastianSchmidl mentioned this issue Apr 13, 2025

[ENH] Implement K-Shape clusterer #2661

Open

tanishy7777 linked a pull request Apr 13, 2025 that will close this issue

[BUG] Fixes sbd distances for multivariate case #2742

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Inconsistent Sbd distance with `tslearn` and other implementations #2674

[BUG] Inconsistent Sbd distance with `tslearn` and other implementations #2674

tanishy7777 commented Mar 22, 2025

SebastianSchmidl commented Apr 13, 2025

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025 •

edited

Loading

SebastianSchmidl commented Apr 15, 2025

SebastianSchmidl commented Apr 15, 2025

tanishy7777 commented Apr 15, 2025 •

edited

Loading

tanishy7777 commented Apr 15, 2025

[BUG] Inconsistent Sbd distance with tslearn and other implementations #2674

[BUG] Inconsistent Sbd distance with tslearn and other implementations #2674

Comments

tanishy7777 commented Mar 22, 2025

Describe the bug

Steps/Code to reproduce the bug

Expected results

Actual results

Versions

SebastianSchmidl commented Apr 13, 2025

tanishy7777 commented Apr 13, 2025 • edited Loading

tanishy7777 commented Apr 13, 2025

tanishy7777 commented Apr 13, 2025 • edited Loading

tanishy7777 commented Apr 13, 2025 • edited Loading

tanishy7777 commented Apr 13, 2025 • edited Loading

SebastianSchmidl commented Apr 15, 2025

SebastianSchmidl commented Apr 15, 2025

tanishy7777 commented Apr 15, 2025 • edited Loading

tanishy7777 commented Apr 15, 2025

[BUG] Inconsistent Sbd distance with `tslearn` and other implementations #2674

[BUG] Inconsistent Sbd distance with `tslearn` and other implementations #2674

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 15, 2025 •

edited

Loading