Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NULL handling in sketch functions #145

Open
nikunjbhartia opened this issue Mar 12, 2025 · 3 comments
Open

NULL handling in sketch functions #145

nikunjbhartia opened this issue Mar 12, 2025 · 3 comments

Comments

@nikunjbhartia
Copy link

Can we handle nulls passed as sketches in sketch functions more gracefully ?
Current error messages are confusing and unless we dig deeper into a sql query and intermediate data, its difficult to find root cause of the error.

Example:
select bqutil.datasketches.theta_sketch_a_not_b(NULL, NULL)

returns error: BindingError: Cannot pass non-string to std::string at UDF$1(BYTES, BYTES, INT64) line 6, columns 25-26

select bqutil.datasketches.theta_sketch_union(NULL, NULL)

returns : BindingError: Cannot pass non-string to std::string at UDF$1(BYTES, BYTES, INT64, INT64) line 12, columns 25-26

Possible handling:

  • Can nulls be treated as a no-op and we return null ?
    Or
  • Return more meaningful error message
@will-lauer
Copy link

I believe Alex has already been addressing this issue, as we also ran into it and needed a solution. I'm not sure when the fix will be available, but I believe the semantics will end up being that a NULL sketch is treated the same as an non-null, empty sketch.

@AlexanderSaydakov
Copy link
Contributor

This issue is slightly different. I addressed aggregate functions here: #140
Now aggregate functions always produce a sketch. Input of all nulls results in an empty sketch.
This issue is about better handling in scalar functions. I agree that we can do better (at least give a better error message). I am inclined to require valid sketches in scalar functions. This is debatable, of course. We can look at how we handle this in other systems like Hive, Pig, Druid and such.
Given that aggregates always produce a sketch, there should be no problem doing things like:
get_estimate(union(column of nulls)) - this will result in 0 estimate
a_not_b(union(nulls), union(nulls)) - this will produce empty sketch
and so on

@will-lauer
Copy link

For scalar functions, I strongly believe that NULL input should result in valid (possibly NULL) output and that we should NOT generate an error on NULL. There are multiple ways NULLs can creep into the system, and we should not fail a query if we encounter one. I believe this would match normal SQL expectations. So, I suggest that the various single argument scalar functions (such as theta_get_estimate) produce null when given null input. The multi-argument scalar functions, such as theta_sketch_union and theta_sketch_a_not_b are a bit more controversial. For those, while traditional SQL semantics would require one null input to nullify the output, I think we want to treat nulls the same way aggregation should - null sketches behave like empty sketches. So A union NULL produces A (not null). A a_not_b NULL also produces A (not null). Intersection is slightly weird, but I think A intersect NULL could produce the empty sketch rather than NULL and not confuse anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants