Consider adding theta_sketch_agg_int64_lgk() #144

nikunjbhartia · 2025-03-11T21:01:08Z

Currently there are 2 methods to create theta sketches for int64 :

theta_sketch_agg_int64(value INT64)
theta_sketch_agg_int64_lgk_seed_p(value INT64, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64> NOT AGGREGATE)

If I need to increase precision of the sketch, then I am forced to pass a seed and p value.
If a sketch is created with a seed, then end users now cannot use theta_sketch_agg_union(sketch BYTES) anymore and are forced to remember the seed value and use theta_sketch_agg_union_lgk_seed(sketch BYTES, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE)

Using theta_sketch_agg_union() on a sketch initialized with a seed gives me following error:

seed hash mismatch: expected 37836, actual 54156 at bqutil.datasketches.theta_sketch_agg_union_lgk_seed(BYTES, STRUCT<lg_k INT64, seed INT64>) line 55, columns 6-7; reason: invalidQuery, location: query, message: Error: seed hash mismatch: expected 37836, actual 54156 at bqutil.datasketches.theta_sketch_agg_union_lgk_seed(BYTES, STRUCT<lg_k INT64, seed INT64>) line 55, columns 6-7

Can we consider adding a functiontheta_sketch_agg_int64_lgk((value INT64, lg_k INT64) ?

The text was updated successfully, but these errors were encountered:

AlexanderSaydakov · 2025-03-11T21:10:03Z

We could add a function, but we tried to avoid combinatorial explosion of them. One can pass NULL for the default parameters. So for lgk=14 I would suggest passing STRUCT(14, NULL, NULL) and STRUCT(14, NULL) respectively. Would that work for you?

jmalkin · 2025-03-11T21:13:56Z

Or BQ can add support for default parameter values when things are unspecified? :D

nikunjbhartia · 2025-03-11T21:18:53Z

I agree, having default params and function overloading would have made lives much easier :)

Regaring the other comment,
As an end user, I wouldn't know if I should use theta_sketch_agg_union() or theta_sketch_agg_union_lgk_seed()
unless I have information about how the sketch was initialized. I would imagine that sketch creation and sketch extraction would be different flows and teams.

But now I think just having the lg_k version of agg wouldn't help and we might also need lg_k version of union and I now understand when you say combinatorial explosion :D

AlexanderSaydakov · 2025-03-11T21:21:53Z

Also I think we had some problem with passing a subset of parameters from functions with fewer parameters to full-signature functions. BQ said something like the struct was not a constant or literal. So we ended up with all or nothing approach.

AlexanderSaydakov · 2025-03-11T21:34:45Z

To clarify my previous comment. We tried partial signatures, but it did not work.
Suppose we want to have
theta_sketch_agg_union_lgk(sketch, lgk) delegating to theta_sketch_agg_union_lgk_seed(sketch, struct(lgk, null))
BQ did not allow that saying that the struct is not a literal or something like that.

nikunjbhartia · 2025-03-11T21:54:22Z

I see what you mean.
Just wondering why do we need lg_k while merging in the first place ? Shouldn't we be unconditionally downgrading the precision to the lowest of all merged sketches ?

If I have a sketch initialized with lg_k 12
As a user, if I get an option to provide higher precision while merging I would imagine I would get more precise results but that wouldn't be the case. Isn't it ?

AlexanderSaydakov · 2025-03-11T22:08:10Z

When we create a fresh union object we need to know lgk. Incoming compact theta sketches don't have any notion of lgk in them. This can be different for other sketches (like HLL or CPC), but we decided to always pass lgk to union (implicitly or explicitly).
Regarding accuracy. The user can get slightly more accurate results by setting higher lgk to union even with lower lgk sketches. In a general case I would suggest to start from the desired accuracy and set lgk accordingly for every step, but in some cases the user might not have control over previous steps.

AlexanderSaydakov · 2025-03-12T03:06:15Z

Consider this extreme example: all sketches happened to be in exact mode (did not see enough distinct values to saturate). So the space/accuracy trade-off will be imposed during the union operation (by setting lgk for the union).

nikunjbhartia · 2025-03-12T04:31:08Z

Would the documented error bounds https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html still be valid in these cases ?

AlexanderSaydakov · 2025-03-12T04:40:38Z

I am not sure I understand your question. The documented bounds are valid for both sketch and union.

nikunjbhartia · 2025-03-12T04:50:33Z

I just am not sure how sketches with lower precision ( larger error bounds ), when merged ( using higher precision ) guarantee lower error bounds. The information is already lost before merging.

guess I have more reading to do to understand this better.

Without diverging too much from the main topic here, we can close this issue since there doesn't seem to be an easy way around here.

AlexanderSaydakov · 2025-03-12T04:53:11Z

oh, you mean the case when union has higher lgk. of course, if the information is lost (sketches are in the estimation mode) then the bounds for the lower lgk apply.

will-lauer · 2025-03-12T18:14:36Z

One way to think about it is that information could potentially be lost at multiple places: building the original sketches and each level of merging the sketches. By having a higher lg_k during the merge, you haven't avoided losing info at the first stage (creating the sketches) but you can potentially avoid losing more info at the later stages. This actually results in better error rates in the merge, even though some information has already been lost.

The theta sketch accounts for this in its calculation of relative error. If you didn't have enough items in the sketch to trigger additional sampling when doing your merge, you may end up with a merged sketch that isn't completely filled. Having the larger lg_k doesn't help you there. But if you have enough unique values that you do additional sampling as you merge, you end up with a sketch that contains more values, but values that would have been kept by the initial sketches regardless of which lg_k value would have been chosen in the first stage. This means you are actually better off and your relative error has improved even though lg_k was lower when building the initial sketches. The values that the initial sketches discarded are ones that would be discarded anyway by the better lg_k.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding theta_sketch_agg_int64_lgk() #144

Consider adding theta_sketch_agg_int64_lgk() #144

nikunjbhartia commented Mar 11, 2025 •

edited

Loading

AlexanderSaydakov commented Mar 11, 2025

jmalkin commented Mar 11, 2025 •

edited

Loading

nikunjbhartia commented Mar 11, 2025

AlexanderSaydakov commented Mar 11, 2025

AlexanderSaydakov commented Mar 11, 2025

nikunjbhartia commented Mar 11, 2025

AlexanderSaydakov commented Mar 11, 2025

AlexanderSaydakov commented Mar 12, 2025

nikunjbhartia commented Mar 12, 2025

AlexanderSaydakov commented Mar 12, 2025

nikunjbhartia commented Mar 12, 2025 •

edited

Loading

AlexanderSaydakov commented Mar 12, 2025

will-lauer commented Mar 12, 2025

Consider adding theta_sketch_agg_int64_lgk() #144

Consider adding theta_sketch_agg_int64_lgk() #144

Comments

nikunjbhartia commented Mar 11, 2025 • edited Loading

AlexanderSaydakov commented Mar 11, 2025

jmalkin commented Mar 11, 2025 • edited Loading

nikunjbhartia commented Mar 11, 2025

AlexanderSaydakov commented Mar 11, 2025

AlexanderSaydakov commented Mar 11, 2025

nikunjbhartia commented Mar 11, 2025

AlexanderSaydakov commented Mar 11, 2025

AlexanderSaydakov commented Mar 12, 2025

nikunjbhartia commented Mar 12, 2025

AlexanderSaydakov commented Mar 12, 2025

nikunjbhartia commented Mar 12, 2025 • edited Loading

AlexanderSaydakov commented Mar 12, 2025

will-lauer commented Mar 12, 2025

nikunjbhartia commented Mar 11, 2025 •

edited

Loading

jmalkin commented Mar 11, 2025 •

edited

Loading

nikunjbhartia commented Mar 12, 2025 •

edited

Loading