-
Notifications
You must be signed in to change notification settings - Fork 209
[BUG] Inconsistent Sbd distance with tslearn
and other implementations
#2674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
tslearn
and other implementationstslearn
and other implementations
tslearn
and other implementationstslearn
and other implementations
…kit#2674) * Updated sbd_distance() to handle multivariate data consistently with tslearn and other implementations * added _multivariate_sbd_distance() which finds the correlations for each of the channels and then normalizes using the norm of the multivariate series.
We now got two PRs trying to fix the SBD distance:
Both PRs currently fail the CI and do not prove that they are correct (wrt. to the original implementation). #2715 seems to be closer to the formula proposed in the paper. I would suggest the following:
|
Moved from #2661 (comment) SBD(official implementation and tslearn) works with unequal length time series but requires them to have equal no of channels. But in the # ================== Test unequal length ==================
if dist["name"] in UNEQUAL_LENGTH_SUPPORT_DISTANCES:
#prev code here
# Test multivariate unequal length of shape (n_channels, n_timepoints)
_validate_distance_result(
make_example_2d_numpy_series(5, 5, random_state=1),
make_example_2d_numpy_series(10, 10, random_state=2),
dist["name"],
dist["distance"],
dist["symmetric"],
_expected_distance_results[dist["name"]][3], |
For now I have set the Below is the equivalence between the sbd and tslearn Also tested the equivalence for multivariate and pairwise_distances |
The CI is now passing for #2742 and the correctness is tested in #2674 (comment) and #2674 (comment). Will benchmark my implementation. Also since the Kshape PR #2676 is dependent on this, I went ahead with the PR as #2715 hasn't been active since last week. I am open to collaborate on this PR please let me know your thoughts on this @pvprajwal |
Equivalence between original implementation, tslearn's implementation and fixed implementation as per (#2742) |
I had a question about the warmups for the benchmark in https://github.com/SebastianSchmidl/aeon/blob/bench/numba-distances/numba-benchmark/benchmark.py Here, for warming up for the 3D numpy array case, you have iterated over various values of Just asking out of curiosity, to better understand numba and writing better benchmarks # warmup for Numba JIT
print("Warming up Numba JIT...")
ts1 = rng.random(100)
ts2 = rng.random(100)
for i in [1, 2, 5, 10]:
for func in distance_funcs:
func(ts1.reshape(i, 1, -1), ts2.reshape(i, 1, -1))
func(ts1.reshape(i, 2, -1), ts2.reshape(i, 2, -1)) |
Sure, you could just reuse the same input array multiple times for the warmup. But it is always a good idea to have your warmup input as close as possible to the benchmark input, and I was already iterating over |
Why do you execute the function twice? |
Oh that's not supposed to be there, here are the updated benchmarks |
Ah ok makes sense |
Describe the bug
While implementing K-shape clustering(#2661) I noticed that our implementation for the
sbd_distance
differs slightly from howtslearn
handles it.This doesn't matter for univariate series but for multivariate series we do see differences.
sbd_distance: finds the distance for each channel independently and then takes its average
normalized_cc: finds the correlations for each of the channels and then sums the max of each channels, and then normalizes using the norm of the the entire multivariate series.
I found another implementation of kshape online, one of its contributors is the original author of the kshapes paper (https://dl.acm.org/doi/pdf/10.1145/2723372.2737793). They handle this the same way as
tslearn
. https://github.com/TheDatumOrg/kshape-pythonI am not sure if this is an intentional choice but this changes the final clustering output.
If we use aeon's sbd distances for k-shape clustering we will have to change the test cases to reflect the same, but then we wouldn't have something to compare it against.
Steps/Code to reproduce the bug
Expected results
Expected result is that sbd distance computed for both univariate and multivariate time series are consistent with
tslearn
and https://github.com/TheDatumOrg/kshape-pythonActual results
1st example is multivariate time series which is inconsistent with

tslearn
and https://github.com/TheDatumOrg/kshape-pythonVersions
System:
python: 3.12.8 | packaged by conda-forge | (main, Dec 5 2024, 14:06:27) [MSC v.1942 64 bit (AMD64)]
executable: D:\Open Source\aeon\aeon-venv\Scripts\python.exe
machine: Windows-11-10.0.22631-SP0
Python dependencies:
aeon: 1.0.0
pip: 24.3.1
setuptools: 75.8.0
scikit-learn: 1.5.2
numpy: 1.26.4
numba: 0.60.0
scipy: 1.14.1
pandas: 2.2.3
The text was updated successfully, but these errors were encountered: