Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding theta_sketch_agg_int64_lgk() #144

Open
nikunjbhartia opened this issue Mar 11, 2025 · 13 comments
Open

Consider adding theta_sketch_agg_int64_lgk() #144

nikunjbhartia opened this issue Mar 11, 2025 · 13 comments

Comments

@nikunjbhartia
Copy link

nikunjbhartia commented Mar 11, 2025

Currently there are 2 methods to create theta sketches for int64 :

  • theta_sketch_agg_int64(value INT64)
  • theta_sketch_agg_int64_lgk_seed_p(value INT64, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64> NOT AGGREGATE)

If I need to increase precision of the sketch, then I am forced to pass a seed and p value.
If a sketch is created with a seed, then end users now cannot use theta_sketch_agg_union(sketch BYTES) anymore and are forced to remember the seed value and use theta_sketch_agg_union_lgk_seed(sketch BYTES, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE)

Using theta_sketch_agg_union() on a sketch initialized with a seed gives me following error:

seed hash mismatch: expected 37836, actual 54156 at bqutil.datasketches.theta_sketch_agg_union_lgk_seed(BYTES, STRUCT<lg_k INT64, seed INT64>) line 55, columns 6-7; reason: invalidQuery, location: query, message: Error: seed hash mismatch: expected 37836, actual 54156 at bqutil.datasketches.theta_sketch_agg_union_lgk_seed(BYTES, STRUCT<lg_k INT64, seed INT64>) line 55, columns 6-7

Can we consider adding a functiontheta_sketch_agg_int64_lgk((value INT64, lg_k INT64) ?

@AlexanderSaydakov
Copy link
Contributor

We could add a function, but we tried to avoid combinatorial explosion of them. One can pass NULL for the default parameters. So for lgk=14 I would suggest passing STRUCT(14, NULL, NULL) and STRUCT(14, NULL) respectively. Would that work for you?

@jmalkin
Copy link

jmalkin commented Mar 11, 2025

Or BQ can add support for default parameter values when things are unspecified? :D

@nikunjbhartia
Copy link
Author

I agree, having default params and function overloading would have made lives much easier :)

Regaring the other comment,
As an end user, I wouldn't know if I should use theta_sketch_agg_union() or theta_sketch_agg_union_lgk_seed()
unless I have information about how the sketch was initialized. I would imagine that sketch creation and sketch extraction would be different flows and teams.

But now I think just having the lg_k version of agg wouldn't help and we might also need lg_k version of union and I now understand when you say combinatorial explosion :D

@AlexanderSaydakov
Copy link
Contributor

Also I think we had some problem with passing a subset of parameters from functions with fewer parameters to full-signature functions. BQ said something like the struct was not a constant or literal. So we ended up with all or nothing approach.

@AlexanderSaydakov
Copy link
Contributor

To clarify my previous comment. We tried partial signatures, but it did not work.
Suppose we want to have
theta_sketch_agg_union_lgk(sketch, lgk) delegating to theta_sketch_agg_union_lgk_seed(sketch, struct(lgk, null))
BQ did not allow that saying that the struct is not a literal or something like that.

@nikunjbhartia
Copy link
Author

I see what you mean.
Just wondering why do we need lg_k while merging in the first place ? Shouldn't we be unconditionally downgrading the precision to the lowest of all merged sketches ?

If I have a sketch initialized with lg_k 12
As a user, if I get an option to provide higher precision while merging I would imagine I would get more precise results but that wouldn't be the case. Isn't it ?

@AlexanderSaydakov
Copy link
Contributor

When we create a fresh union object we need to know lgk. Incoming compact theta sketches don't have any notion of lgk in them. This can be different for other sketches (like HLL or CPC), but we decided to always pass lgk to union (implicitly or explicitly).
Regarding accuracy. The user can get slightly more accurate results by setting higher lgk to union even with lower lgk sketches. In a general case I would suggest to start from the desired accuracy and set lgk accordingly for every step, but in some cases the user might not have control over previous steps.

@AlexanderSaydakov
Copy link
Contributor

Consider this extreme example: all sketches happened to be in exact mode (did not see enough distinct values to saturate). So the space/accuracy trade-off will be imposed during the union operation (by setting lgk for the union).

@nikunjbhartia
Copy link
Author

Would the documented error bounds https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html still be valid in these cases ?

@AlexanderSaydakov
Copy link
Contributor

I am not sure I understand your question. The documented bounds are valid for both sketch and union.

@nikunjbhartia
Copy link
Author

nikunjbhartia commented Mar 12, 2025

I just am not sure how sketches with lower precision ( larger error bounds ), when merged ( using higher precision ) guarantee lower error bounds. The information is already lost before merging.

guess I have more reading to do to understand this better.

Without diverging too much from the main topic here, we can close this issue since there doesn't seem to be an easy way around here.

@AlexanderSaydakov
Copy link
Contributor

oh, you mean the case when union has higher lgk. of course, if the information is lost (sketches are in the estimation mode) then the bounds for the lower lgk apply.

@will-lauer
Copy link

One way to think about it is that information could potentially be lost at multiple places: building the original sketches and each level of merging the sketches. By having a higher lg_k during the merge, you haven't avoided losing info at the first stage (creating the sketches) but you can potentially avoid losing more info at the later stages. This actually results in better error rates in the merge, even though some information has already been lost.

The theta sketch accounts for this in its calculation of relative error. If you didn't have enough items in the sketch to trigger additional sampling when doing your merge, you may end up with a merged sketch that isn't completely filled. Having the larger lg_k doesn't help you there. But if you have enough unique values that you do additional sampling as you merge, you end up with a sketch that contains more values, but values that would have been kept by the initial sketches regardless of which lg_k value would have been chosen in the first stage. This means you are actually better off and your relative error has improved even though lg_k was lower when building the initial sketches. The values that the initial sketches discarded are ones that would be discarded anyway by the better lg_k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants