-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding theta_sketch_agg_int64_lgk() #144
Comments
We could add a function, but we tried to avoid combinatorial explosion of them. One can pass NULL for the default parameters. So for lgk=14 I would suggest passing STRUCT(14, NULL, NULL) and STRUCT(14, NULL) respectively. Would that work for you? |
Or BQ can add support for default parameter values when things are unspecified? :D |
I agree, having default params and function overloading would have made lives much easier :) Regaring the other comment, But now I think just having the lg_k version of agg wouldn't help and we might also need lg_k version of union and I now understand when you say |
Also I think we had some problem with passing a subset of parameters from functions with fewer parameters to full-signature functions. BQ said something like the struct was not a constant or literal. So we ended up with all or nothing approach. |
To clarify my previous comment. We tried partial signatures, but it did not work. |
I see what you mean. If I have a sketch initialized with lg_k 12 |
When we create a fresh union object we need to know lgk. Incoming compact theta sketches don't have any notion of lgk in them. This can be different for other sketches (like HLL or CPC), but we decided to always pass lgk to union (implicitly or explicitly). |
Consider this extreme example: all sketches happened to be in exact mode (did not see enough distinct values to saturate). So the space/accuracy trade-off will be imposed during the union operation (by setting lgk for the union). |
Would the documented error bounds https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html still be valid in these cases ? |
I am not sure I understand your question. The documented bounds are valid for both sketch and union. |
I just am not sure how sketches with lower precision ( larger error bounds ), when merged ( using higher precision ) guarantee lower error bounds. The information is already lost before merging. guess I have more reading to do to understand this better. Without diverging too much from the main topic here, we can close this issue since there doesn't seem to be an easy way around here. |
oh, you mean the case when union has higher lgk. of course, if the information is lost (sketches are in the estimation mode) then the bounds for the lower lgk apply. |
One way to think about it is that information could potentially be lost at multiple places: building the original sketches and each level of merging the sketches. By having a higher lg_k during the merge, you haven't avoided losing info at the first stage (creating the sketches) but you can potentially avoid losing more info at the later stages. This actually results in better error rates in the merge, even though some information has already been lost. The theta sketch accounts for this in its calculation of relative error. If you didn't have enough items in the sketch to trigger additional sampling when doing your merge, you may end up with a merged sketch that isn't completely filled. Having the larger lg_k doesn't help you there. But if you have enough unique values that you do additional sampling as you merge, you end up with a sketch that contains more values, but values that would have been kept by the initial sketches regardless of which lg_k value would have been chosen in the first stage. This means you are actually better off and your relative error has improved even though lg_k was lower when building the initial sketches. The values that the initial sketches discarded are ones that would be discarded anyway by the better lg_k. |
Currently there are 2 methods to create theta sketches for int64 :
If I need to increase precision of the sketch, then I am forced to pass a seed and p value.
If a sketch is created with a seed, then end users now cannot use theta_sketch_agg_union(sketch BYTES) anymore and are forced to remember the seed value and use theta_sketch_agg_union_lgk_seed(sketch BYTES, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE)
Using theta_sketch_agg_union() on a sketch initialized with a seed gives me following error:
Can we consider adding a function
theta_sketch_agg_int64_lgk((value INT64, lg_k INT64)
?The text was updated successfully, but these errors were encountered: