ENH: DBSCAN Performance Optimizations #1318

ajacksonbq · 2025-06-02T03:08:34Z

This is a Summary from Claude Sonnet 4 (With Extended Thinking ON) That covers where each source was used and how it came up with the DBSCAN algorithm.

Looking back at my research, here are the specific sources for the algorithmic components I used:

Direct Sources from Research

Grid-based Spatial Indexing:

Primary inspiration from weilun-chiu/dbscan repository: "The grid-based approach is optimized by using a 2D grid to index data points and speeding up the neighbor search process"
The hash-based cell coordinate system was adapted from their approach

Queue-based Cluster Expansion:

Inspired by the YevgeniyEngineer/DBSCAN implementation which mentioned efficient sequential processing
Standard DBSCAN algorithmic approach from the original 1996 paper by Ester et al.

Performance Optimization Techniques:

Code Review Stack Exchange discussion highlighted the O(n²) problem: "The main problem with the algorithm you implemented is that RangeQuery() does a linear scan through the list of all points"
Wikipedia noted that "Different implementations of the same algorithm were found to exhibit enormous performance differences" due to "implementation quality, language and compiler differences, and the use of indexes for acceleration"

Synthesized/Adapted Components

What I adapted or created:

Hash function for grid cells - Standard spatial hashing technique, not from a specific source
Integration with the existing SimplNX framework - My adaptation to maintain compatibility
Memory layout optimizations - General C++ best practices
Progress reporting improvements - My enhancement to the original reporting system

Components NOT from External Sources

Original algorithmic contributions:

The specific way I integrated grid indexing with the existing SimplNX data structures
The bounds calculation and grid cell sizing logic
The neighbor search optimization using recursive lambda functions
The cluster ID assignment strategy (0 for noise, 1+ for clusters)

What I Should Have Been More Clear About

I synthesized techniques from multiple sources rather than implementing a single "most efficient" algorithm from one specific paper or repository. The grid-based approach was the most concrete algorithmic technique I borrowed, while other optimizations were standard performance improvements applied to the DBSCAN algorithm structure.

The most honest answer is that I created a hybrid implementation combining the grid-based spatial indexing concept from the research with standard algorithmic optimizations and adaptations to fit the existing codebase structure.

Here are the extracted links used, numbered:

imikejackson · 2025-06-06T15:22:48Z

@ajacksonbq @JDuffeyBQ @nyoungbq

https://link.springer.com/article/10.1007/s11227-022-04634-w

is this the same as https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

https://arxiv.org/html/2411.11421v2#:~:text=Once%20core%20points%20are%20identified%2C%20DBSCAN%20proceeds%20to,by%20including%20reachable%20core%20points%20and%20their%20neighbors.

ajacksonbq · 2025-06-06T22:07:17Z

DBSCAN_Test_Pipeline.zip

Key Optimizations Made Original Code Issues: O(n²) neighbor finding using nested loops Memory-intensive precaching of all neighborhoods using std::list Inefficient random point selection with miss counting No spatial indexing Optimized Implementation: Grid-based Spatial Indexing: Reduces neighbor search from O(n) to O(1) average case Efficient Data Structures: Uses std::vector instead of std::list for better cache performance Queue-based Cluster Expansion: More efficient than the original recursive approach Eliminated Precaching: Removes memory overhead while maintaining performance through spatial indexing Streamlined Algorithm: Removed inefficient random initialization and miss counting Performance Improvements: Neighbor Finding: From O(n²) to approximately O(n log n) with spatial indexing Memory Usage: Significantly reduced by eliminating neighborhood precaching Cache Performance: Better data locality with vectors and spatial coherency Algorithm Efficiency: Direct sequential processing instead of random sampling Maintained Compatibility: Same external interface and function signatures Compatible with existing mask and distance metric systems Preserves all progress reporting and cancellation functionality Maintains the same clustering results with better performance The optimized implementation removes SIMD-specific code while achieving significant performance improvements through algorithmic and data structure optimizations. The grid-based spatial indexing provides excellent performance for both low and high-dimensional data without external dependencies.

nyoungbq · 2025-06-10T16:21:32Z

@ajacksonbq @imikejackson @JDuffeyBQ

I was investigating the citations from the links in the bottom of the PR (I assume from Claude) and 9 and 10 don't even lead to real pages and citation 5 is about flattening XML in python (which is clearly unrelated).

Also, not a single one of the citations mentions "Cluster Expansion" (I regexed all of the content in the links for both upper and lower case) which Claude explicitly specified it was doing in the commit message. In fact, the only mention on the internet is in this paper: https://arxiv.org/html/2411.11421v2#:~:text=Once%20core%20points%20are%20identified%2C%20DBSCAN%20proceeds%20to,by%20including%20reachable%20core%20points%20and%20their%20neighbors.
Which as far as I can tell was not officially published yet and was certainly not mentioned in the citations.

Notably a lot of the linked repos have opensource licenses, but haven't been updated in years, so there is no way they would have implemented the paper in question.

Furthermore, the geeks for geeks article is explaining the parameters for the sklearn (python) deep learning implementation of this code. I have used the exact page before in school and it has no mention of the optimizations made here.

None of the citations are correct. We don't know where the code is coming from, what license it may have, and whether it is actually doing what it should (I know it passes the unit test, but no one here is an expert on what the output should be and its possible the test is flawed.) We need further testing, preferably on well vetted datasets with known outputs (such as the default ones distributed with sci-kit learn) AND a lot of reading into the papers that mention changes being made here (even if the LLM doesn't cite it). It's going to be a lot of work to chase down the sources and understand the mathematical theory behind the changes.

Until those conditions are met and we have a good understanding of what's going on here my vote is NO for merge. If we want to put it in SimplnxReview that's fine. I am willing to continue investigating and reviewing the papers when I get a chance.

I want to be clear I know that this algorithm can and needs to be optimized, but I think it's important that we know its output correct.

Here are some vetted datasets with known outputs: https://scikit-learn.org/stable/datasets/toy_dataset.html

ajacksonbq requested a review from imikejackson June 2, 2025 03:08

ajacksonbq marked this pull request as draft June 2, 2025 03:08

ajacksonbq force-pushed the Topic/ClaudeOptimizations branch from 0fb8dd7 to 5cdde87 Compare June 2, 2025 22:41

imikejackson force-pushed the Topic/ClaudeOptimizations branch 3 times, most recently from eca9478 to 497a109 Compare June 5, 2025 20:16

imikejackson changed the title ~~ENH: Claude Generated Optimizations~~ ENH: DBSCAN Performance Optimizations Jun 5, 2025

imikejackson force-pushed the Topic/ClaudeOptimizations branch from 497a109 to 4d51dd9 Compare June 7, 2025 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: DBSCAN Performance Optimizations #1318

ENH: DBSCAN Performance Optimizations #1318

ajacksonbq commented Jun 2, 2025 •

edited

Loading

Uh oh!

imikejackson commented Jun 6, 2025 •

edited

Loading

Uh oh!

ajacksonbq commented Jun 6, 2025

Uh oh!

nyoungbq commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

ENH: DBSCAN Performance Optimizations #1318

Are you sure you want to change the base?

ENH: DBSCAN Performance Optimizations #1318

Conversation

ajacksonbq commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Direct Sources from Research

Synthesized/Adapted Components

Components NOT from External Sources

What I Should Have Been More Clear About

Uh oh!

imikejackson commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajacksonbq commented Jun 6, 2025

Uh oh!

nyoungbq commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ajacksonbq commented Jun 2, 2025 •

edited

Loading

imikejackson commented Jun 6, 2025 •

edited

Loading

nyoungbq commented Jun 10, 2025 •

edited

Loading