HyperLogLog compatibility with Apache Spark #24384

billcrook · 2024-12-05T11:02:54Z

billcrook
Dec 5, 2024

I'm interested in using spark to write HLL data structures that are compatible with Trino (and implicitly aws athena). From what I can tell, trino uses HLL datastructures from the associated airlift project. Spark leverages the apache datasketches library for HLL.

I'm hitting a dead end when trying to find anyone doing this. Before I go down a rabbit hole of writing a custom spark aggregator to use the airlift HLL datastructure, does anyone know of prior art or if there are plans in the trino pipeline to improve this compatibility?

hashhar · 2024-12-06T12:59:43Z

hashhar
Dec 6, 2024
Collaborator

I don't think most people actually store the serialized HLLs into a table.

cc: @electrum if he's thought about interop in the past.

1 reply

electrum Dec 6, 2024
Maintainer

This is a question for @martint

billcrook · 2024-12-06T19:57:23Z

billcrook
Dec 6, 2024
Author

I don't think most people actually store the serialized HLLs into a table.

My use case is nearly 1T events aggregated down into a reasonable grain which the BI team can then further aggregate across many dimensions. I don't believe it's an unusual use case.

2 replies

convex1 Jul 28, 2025

+1 this dead end is real. Any solution to this?

billcrook Aug 11, 2025
Author

@martint Any thoughts on this?

martint · 2025-08-11T22:52:26Z

martint
Aug 11, 2025
Maintainer

The only way to support reading HLLs generated by Spark would be to add functions to Trino that can decode them, compute their union and cardinalities. The two implementations are fundamentally different, and there's no way to make them compatible. Minor aspects such as choice of hash function or which bits to use for calculating the bucket result in data structures that cannot be mixed together and produce a sensible result.

2 replies

billcrook Aug 12, 2025
Author

Thanks for your input.

The other way to do this would be to create a new spark library to read/write HLLs from Trino. Do you know of any documentation/pointers of how it may be possible to leverage the HLL lib used by Trino?

Also, I'm wondering what the relationship between Trino and Airlift is? Specifically, why was an HLL implementation from a 3rd-party rest services framework chosen and is the plan to keep that dependency going forward?

Again, thanks for you help.

martint Aug 12, 2025
Maintainer

Some of the original authors of Trino (in particular, Dain, David and I) worked on Airlift before we started Trino and continue to maintain it, so it was natural to use it. I wrote the HLL implementation in Airlift.

HyperLogLog compatibility with Apache Spark #24384

Uh oh!

billcrook Dec 5, 2024

Replies: 3 comments · 5 replies

Uh oh!

hashhar Dec 6, 2024 Collaborator

Uh oh!

electrum Dec 6, 2024 Maintainer

Uh oh!

billcrook Dec 6, 2024 Author

Uh oh!

Uh oh!

convex1 Jul 28, 2025

Uh oh!

billcrook Aug 11, 2025 Author

Uh oh!

Uh oh!

martint Aug 11, 2025 Maintainer

Uh oh!

billcrook Aug 12, 2025 Author

Uh oh!

martint Aug 12, 2025 Maintainer

billcrook
Dec 5, 2024

Replies: 3 comments 5 replies

hashhar
Dec 6, 2024
Collaborator

electrum Dec 6, 2024
Maintainer

billcrook
Dec 6, 2024
Author

billcrook Aug 11, 2025
Author

martint
Aug 11, 2025
Maintainer

billcrook Aug 12, 2025
Author

martint Aug 12, 2025
Maintainer