-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
airframe-sql: Fixes #2646 Resolve AllColumns inputs as ResolvedAttributes #2649
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #2649 +/- ##
==========================================
+ Coverage 82.12% 82.14% +0.01%
==========================================
Files 334 334
Lines 14030 14050 +20
Branches 2182 2213 +31
==========================================
+ Hits 11522 11541 +19
- Misses 2508 2509 +1
Continue to review full report at Codecov.
|
case expr => if (isSelectItem) SingleColumn(expr, Some(i.value), None, expr.nodeLocation) else expr | ||
case f: FunctionCall => | ||
// Extract source columns | ||
val src = Set.newBuilder[SourceColumn] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SourceColumn contains a TableCatalog parameter, so using Set operation might be a bit heavy if the table schema is large.
// override def dataType: DataType = super.dataType | ||
override def dataType: DataType = { | ||
if (functionName == "count") { | ||
DataType.LongType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
} | ||
col shouldMatch { case ResolvedAttribute("cnt", DataType.LongType, _, sourceColumns, _) => } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, what's sourceColumns
here? It might be better to verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the sourceColumns need to be the all of the column data used for generating this ResolvedAttribute. If we use airframe-sql for tracking data lineage, we may need to maintain this property across SQL expression nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if the function is max/min/max_by/min_by etc., source column information will be necessary to show the result contains any PID. If the function is count(1), personal information will be lost, so sourceColumns might be unnecessary because raw column values will not be exposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sourceColumns
seems an empty List in this pull request. Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about defining sourceColumns
as input columns without any modifications so that data lineage with nested function calls, such as count(max(..(col)))
, can be traced in a different manner outside TypeResolver?
If it sounds good, I will fix code and tests accordingly. In this case, cnt is from count(), so sourceColumn should be empty, but the function argument (AllColumns) of FunctionCall("count", Seq(AllColumns(, sourceColumns(...)) ) ) should have a source column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like AllColumns
(in SELECT clause) isn't resolved before cnt
(in HAVING clause) is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that this query is actually invalid on Presto:
select id, count(*) cnt from A group by id having cnt > 10
Got the following error:
Query 20221219_122609_00005_r3q5y failed: line 1:84: Column 'cnt' cannot be resolved
Meaning we don't need to resolve cnt
in HAVING clause in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Fixed a bug in resolve HAVING clause airframe-sql: Fixes #2646 Resolve AllColumns inputs as ResolvedAttributes #2649 (comment)
- Make explicit that ResolvedAttribute.sourceColumns is given only when the attribute is a direct reference of table columns without any change (e.g., via function). https://github.com/wvlet/airframe/pull/2649/files#diff-5f3b4602b1d9bc6f2aeff669d9b32b2a17d1e369c856d09b295eac7de8118c97R63-R64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make explicit that ResolvedAttribute.sourceColumns is given only when the attribute is a direct reference of table columns without any change
Oh, I see.
@@ -379,7 +385,27 @@ object TypeResolver extends LogSupport { | |||
case a: Attribute => a | |||
case m: MultiColumn => m | |||
// retain alias for select column | |||
case expr => if (isSelectItem) SingleColumn(expr, Some(i.value), None, expr.nodeLocation) else expr | |||
case f: FunctionCall => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if the same way can be applied to any expression not only FunctionCall
. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's also my concern. Maybe converting every type of expression other than already resolved Attribute/MultiColumn as ResolvedAttribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% but maybe that should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked as TODO for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved every matched expressions as ResolvedAttribute here
GroupingKey(e, e.nodeLocation) | ||
}) | ||
val resolvedHaving = having.map { | ||
_.transformUpExpression { case x: Expression => | ||
resolveExpression(context, x, a.outputAttributes, false) | ||
// Having recognize attributes only from the input relation | ||
resolveExpression(context, x, childOutputAttributes, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part had a bug, which was found in
#2649 (comment)
|
||
val prefix = m match { | ||
case t: TableScan => | ||
s"${ws}[${m.modelName}] ${t.table.fullName}${functionSig}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For showing the input table name after TableRef is resolved to TableScan
@@ -97,7 +99,7 @@ case class ResolvedAttribute( | |||
case (Some(q), columns) if columns.nonEmpty => | |||
columns | |||
.map(_.fullName) | |||
.mkString(s"${q},${typeDescription} <- [", ", ", "]") | |||
.mkString(s"${q}.${typeDescription} <- [", ", ", "]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed a typo
GroupingKey(e, e.nodeLocation) | ||
}) | ||
val resolvedHaving = having.map { | ||
_.transformUpExpression { case x: Expression => | ||
resolveExpression(context, x, a.outputAttributes, false) | ||
// Having recognize attributes only from the input relation | ||
resolveExpression(context, x, childOutputAttributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needed to use the resolved child attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -324,15 +331,14 @@ object TypeResolver extends LogSupport { | |||
private def findMatchInInputAttributes( | |||
context: AnalyzerContext, | |||
expr: Expression, | |||
inputAttributes: Seq[Attribute], | |||
isSelectItem: Boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiColumn is changed to Attribute, so we can use it as SelectItem as is.
An example plan after this fix:

arbitrary
aggregation function data type, we need to provide function tables. When no function table is available, resolving its DataType asname: ? (Unknown type)
is ok. count(...) always returns Long, so added a quick path.