Skip to content

Add implementation for exponential histogram merging and percentiles #131220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

JonasKunz
Copy link

Still a WIP.

  • Have you signed the contributor license agreement?
  • Have you followed the contributor guidelines?
  • If submitting code, have you built your formula locally prior to submission with gradle check?
  • If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
  • If submitting code, have you checked that your submission is for an OS and architecture that we support?
  • If you are submitting this code for a class then read our policy for that.

@elasticsearchmachine elasticsearchmachine added v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jul 14, 2025
@JonasKunz JonasKunz force-pushed the exponentional-histos branch from 7711093 to a46914a Compare July 15, 2025 12:40

This offers significant benefits for distributions with fewer distinct values:
If we have at least as many buckets as we have distinct values to store in the histogram, we can almost exactly represent this distribution.
This can be achieved by simply maintaining the scale at the maximum supported value (so the buckets become the smallest).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are tricky to grasp. Is there some external reference explaining the differences between dense and sparse storage, and how the number of buckets and scale are associated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are any sources explaining this clearly. The DDSketch and UDDSketch papers use sparse representations in the papers, but practically all implementations are dense due to the O(1) insertion time I assume.

I added a concrete example in 5e1ca08, hope this makes it clearer?

long negCount = getTotalCount(histo.negativeBuckets());
long posCount = getTotalCount(histo.positiveBuckets());

long totalCount = zeroCount + negCount + posCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely that we'll get many quantile calls for the same histogram, so we might as well store this number in the histogram.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was part of the TODOs, but should be simple enough to do it now.

If we expect many calls for the same histogram, it could make sense to build an array representing the prefix-sum of the counts. This allows for percentile computation in O(log n) instead of O(n) at the cost of a little memory. But this is definitely something for later if we see the need for it.

@JonasKunz JonasKunz force-pushed the exponentional-histos branch from ba259bd to b308838 Compare July 18, 2025 07:01
@JonasKunz
Copy link
Author

Applied the changes of the first reviews and a bit of cleanup / refactoring. I also am hopeful that the tests do not introduce flakes, as I ran the randomized ones locally continously for quite a while.

@JonasKunz JonasKunz marked this pull request as ready for review July 18, 2025 13:20
@JonasKunz JonasKunz requested a review from a team as a code owner July 18, 2025 13:20
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 18, 2025
@felixbarny felixbarny added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Jul 18, 2025
@elasticsearchmachine elasticsearchmachine added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Jul 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants