Replies: 3 comments 5 replies
-
I don't think most people actually store the serialized HLLs into a table. cc: @electrum if he's thought about interop in the past. |
Beta Was this translation helpful? Give feedback.
-
My use case is nearly 1T events aggregated down into a reasonable grain which the BI team can then further aggregate across many dimensions. I don't believe it's an unusual use case. |
Beta Was this translation helpful? Give feedback.
-
The only way to support reading HLLs generated by Spark would be to add functions to Trino that can decode them, compute their union and cardinalities. The two implementations are fundamentally different, and there's no way to make them compatible. Minor aspects such as choice of hash function or which bits to use for calculating the bucket result in data structures that cannot be mixed together and produce a sensible result. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm interested in using spark to write HLL data structures that are compatible with Trino (and implicitly aws athena). From what I can tell, trino uses HLL datastructures from the associated airlift project. Spark leverages the apache datasketches library for HLL.
I'm hitting a dead end when trying to find anyone doing this. Before I go down a rabbit hole of writing a custom spark aggregator to use the airlift HLL datastructure, does anyone know of prior art or if there are plans in the trino pipeline to improve this compatibility?
Beta Was this translation helpful? Give feedback.
All reactions