Skip to content

[BUG] Inconsistent Sbd distance with tslearn and other implementations #2674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tanishy7777 opened this issue Mar 22, 2025 · 10 comments · May be fixed by #2715 or #2742
Open

[BUG] Inconsistent Sbd distance with tslearn and other implementations #2674

tanishy7777 opened this issue Mar 22, 2025 · 10 comments · May be fixed by #2715 or #2742
Labels
bug Something isn't working clustering Clustering package distances Distances package

Comments

@tanishy7777
Copy link
Contributor

Describe the bug

While implementing K-shape clustering(#2661) I noticed that our implementation for the sbd_distance differs slightly from how tslearn handles it.

This doesn't matter for univariate series but for multivariate series we do see differences.

sbd_distance: finds the distance for each channel independently and then takes its average
normalized_cc: finds the correlations for each of the channels and then sums the max of each channels, and then normalizes using the norm of the the entire multivariate series.

I found another implementation of kshape online, one of its contributors is the original author of the kshapes paper (https://dl.acm.org/doi/pdf/10.1145/2723372.2737793). They handle this the same way as tslearn. https://github.com/TheDatumOrg/kshape-python

I am not sure if this is an intentional choice but this changes the final clustering output.
If we use aeon's sbd distances for k-shape clustering we will have to change the test cases to reflect the same, but then we wouldn't have something to compare it against.

Steps/Code to reproduce the bug

import numpy as np
from aeon.distances import sbd_distance
from tslearn.metrics.cycc import normalized_cc

# Multivariate time series
np.random.seed(1)
x = np.random.rand(6, 100)
y = np.random.rand(6, 100)
a = sbd_distance(x, y, standardize=False)
print(x.shape, y.shape)

z = np.transpose(x, (1, 0))
w = np.transpose(y, (1, 0))
b = 1.0 - normalized_cc(z, w).max()
print(a, b)


# Univariate time series
np.random.seed(1)
x = np.random.rand(1, 200)
y = np.random.rand(1, 200)
print(x.shape, y.shape)

a = sbd_distance(x, y, standardize=False)

x = np.transpose(x, (1, 0))
y = np.transpose(y, (1, 0))

b = 1.0 - normalized_cc(x, y).max()
print(a, b)

Expected results

Expected result is that sbd distance computed for both univariate and multivariate time series are consistent with tslearn and https://github.com/TheDatumOrg/kshape-python

Actual results

1st example is multivariate time series which is inconsistent with tslearn and https://github.com/TheDatumOrg/kshape-python
Image

Versions

System:
python: 3.12.8 | packaged by conda-forge | (main, Dec 5 2024, 14:06:27) [MSC v.1942 64 bit (AMD64)]
executable: D:\Open Source\aeon\aeon-venv\Scripts\python.exe
machine: Windows-11-10.0.22631-SP0
Python dependencies:
aeon: 1.0.0
pip: 24.3.1
setuptools: 75.8.0
scikit-learn: 1.5.2
numpy: 1.26.4
numba: 0.60.0
scipy: 1.14.1
pandas: 2.2.3

@tanishy7777 tanishy7777 added the bug Something isn't working label Mar 22, 2025
@tanishy7777 tanishy7777 changed the title [BUG] Inconsistent Sbd distance with tslearn and other implementations Inconsistent Sbd distance with tslearn and other implementations Mar 23, 2025
@MatthewMiddlehurst MatthewMiddlehurst changed the title Inconsistent Sbd distance with tslearn and other implementations [BUG] Inconsistent Sbd distance with tslearn and other implementations Mar 23, 2025
@MatthewMiddlehurst MatthewMiddlehurst added clustering Clustering package distances Distances package labels Mar 23, 2025
pvprajwal added a commit to pvprajwal/aeon that referenced this issue Mar 31, 2025
…kit#2674)

* Updated sbd_distance() to handle multivariate data consistently with tslearn and other implementations

* added _multivariate_sbd_distance() which finds the correlations for each of the channels and then normalizes using the norm of the multivariate series.
pvprajwal added a commit to pvprajwal/aeon that referenced this issue Mar 31, 2025
…kit#2674)

* Updated sbd_distance() to handle multivariate data consistently with tslearn and other implementations

* added _multivariate_sbd_distance() which finds the correlations for each of the channels and then normalizes using the norm of the multivariate series.
a61927f
@tanishy7777 tanishy7777 linked a pull request Apr 13, 2025 that will close this issue
5 tasks
@SebastianSchmidl
Copy link
Member

We now got two PRs trying to fix the SBD distance:

Both PRs currently fail the CI and do not prove that they are correct (wrt. to the original implementation). #2715 seems to be closer to the formula proposed in the paper. I would suggest the following:

  1. Please write each other.
  2. Fix your solutions and check their correctness by comparing the results to the original implementation, the current aeon implementation (univariate), and to the respective other solution.
  3. Benchmark your implementations. I would then merge the faster/better-scaling solution. (benchmark inspiration: [ENH] SBD and MSM: Support pairwise distance calculation for unequal length time series #1287 (comment))

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 13, 2025

Moved from #2661 (comment)

SBD(official implementation and tslearn) works with unequal length time series but requires them to have equal no of channels.

Image

But in the aeom/tests/test_distances.py under the UNEQUAL_LENGTH_SUPPORT for multivariate case:
We are checking for both unequal length and unequal no of channels? Shouldn't we keep the channels same and if we do want to check if the distance works with different no of channels check that separately?

# ================== Test unequal length ==================
    if dist["name"] in UNEQUAL_LENGTH_SUPPORT_DISTANCES:
        #prev code here       

        # Test multivariate unequal length of shape (n_channels, n_timepoints)
        _validate_distance_result(
            make_example_2d_numpy_series(5, 5, random_state=1),
            make_example_2d_numpy_series(10, 10, random_state=2),
            dist["name"],
            dist["distance"],
            dist["symmetric"],
            _expected_distance_results[dist["name"]][3],

@tanishy7777
Copy link
Contributor Author

For now I have set the unequal_support=False in aeon/distances/_distance.py to make the test_distances pass
and opened a PR #2742

Below is the equivalence between the sbd and tslearn
Image

Also tested the equivalence for multivariate and pairwise_distances
Image

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 13, 2025

We now got two PRs trying to fix the SBD distance:

* [[BUG] Updated sbd_distance() to handle multivariate series (#2674) #2715](https://github.com/aeon-toolkit/aeon/pull/2715) by [@pvprajwal](https://github.com/pvprajwal)

* [[BUG] Fixes sbd distances for multivariate case #2742](https://github.com/aeon-toolkit/aeon/pull/2742) by [@tanishy7777](https://github.com/tanishy7777)

Both PRs currently fail the CI and do not prove that they are correct (wrt. to the original implementation). #2715 seems to be closer to the formula proposed in the paper. I would suggest the following:

1. Please write each other.

2. Fix your solutions and check their correctness by comparing the results to the original implementation, the current aeon implementation (univariate), and to the respective other solution.

3. Benchmark your implementations. I would then merge the faster/better-scaling solution. (benchmark inspiration: [[ENH] SBD and MSM: Support pairwise distance calculation for unequal length time series #1287 (comment)](https://github.com/aeon-toolkit/aeon/pull/1287#issuecomment-2018039516))

The CI is now passing for #2742 and the correctness is tested in #2674 (comment) and #2674 (comment). Will benchmark my implementation.

Also since the Kshape PR #2676 is dependent on this, I went ahead with the PR as #2715 hasn't been active since last week.

I am open to collaborate on this PR please let me know your thoughts on this @pvprajwal

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 13, 2025

Equivalence between original implementation, tslearn's implementation and fixed implementation as per (#2742)
Image

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 13, 2025

We now got two PRs trying to fix the SBD distance:

* [[BUG] Updated sbd_distance() to handle multivariate series (#2674) #2715](https://github.com/aeon-toolkit/aeon/pull/2715) by [@pvprajwal](https://github.com/pvprajwal)

* [[BUG] Fixes sbd distances for multivariate case #2742](https://github.com/aeon-toolkit/aeon/pull/2742) by [@tanishy7777](https://github.com/tanishy7777)

Both PRs currently fail the CI and do not prove that they are correct (wrt. to the original implementation). #2715 seems to be closer to the formula proposed in the paper. I would suggest the following:

1. Please write each other.

2. Fix your solutions and check their correctness by comparing the results to the original implementation, the current aeon implementation (univariate), and to the respective other solution.

3. Benchmark your implementations. I would then merge the faster/better-scaling solution. (benchmark inspiration: [[ENH] SBD and MSM: Support pairwise distance calculation for unequal length time series #1287 (comment)](https://github.com/aeon-toolkit/aeon/pull/1287#issuecomment-2018039516))

I had a question about the warmups for the benchmark in https://github.com/SebastianSchmidl/aeon/blob/bench/numba-distances/numba-benchmark/benchmark.py

Here, for warming up for the 3D numpy array case, you have iterated over various values of i. Is this needed? As far as I understood from the numba docs, the functions are cached based on the function signature and in this case it will be cached for func(ndarray[float64, 3d, C], ndarray[float64, 3d, C]). And since for same no of channels the flow of execution is same (calls the same univariate function dispatcher (specific to the PR #1287)). So instead of iterating for different i could we have just iterated some no of of times for a specific univariate timeseries dataset like (20,1,50) called 5 times.

Just asking out of curiosity, to better understand numba and writing better benchmarks

# warmup for Numba JIT
    print("Warming up Numba JIT...")
    ts1 = rng.random(100)
    ts2 = rng.random(100)
    for i in [1, 2, 5, 10]:
        for func in distance_funcs:
            func(ts1.reshape(i, 1, -1), ts2.reshape(i, 1, -1))
            func(ts1.reshape(i, 2, -1), ts2.reshape(i, 2, -1))

@SebastianSchmidl
Copy link
Member

Sure, you could just reuse the same input array multiple times for the warmup. But it is always a good idea to have your warmup input as close as possible to the benchmark input, and I was already iterating over i, so I just used it.

@SebastianSchmidl
Copy link
Member

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 15, 2025

https://github.com/tanishy7777/aeon/blob/ceb552e5b4c1ecafbe907da9aab5e0dd7fdeed21/benchmarks/benchmark.py#L68-L69

Why do you execute the function twice?

Oh that's not supposed to be there, here are the updated benchmarks
Is this acceptable or should I try to match it with the original implementation?

Image
Image
Image

@tanishy7777
Copy link
Contributor Author

Sure, you could just reuse the same input array multiple times for the warmup. But it is always a good idea to have your warmup input as close as possible to the benchmark input, and I was already iterating over i, so I just used it.

Ah ok makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working clustering Clustering package distances Distances package
Projects
None yet
3 participants