Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

airframe-sql: Fixes #2646 Resolve AllColumns inputs as ResolvedAttributes #2649

Merged
merged 16 commits into from
Dec 20, 2022

Conversation

xerial
Copy link
Member

@xerial xerial commented Dec 17, 2022

An example plan after this fix:
image

  • To resolve arbitrary aggregation function data type, we need to provide function tables. When no function table is available, resolving its DataType as name: ? (Unknown type) is ok. count(...) always returns Long, so added a quick path.
  • I think we don't need to propagate detailed expressions like FunctionCall to upper nodes. For example, the following item (iii) can be ResolvedAttribute with source column information:
    1. table scan (source columns: c1, c2, ...)
    2. function call (source columns: c1, c2, ...)
    3. reference to the result of (ii) (source columns: c1, c2, ...)

@xerial xerial changed the base branch from master to agg-fix December 17, 2022 08:54
@github-actions github-actions bot added the bug label Dec 17, 2022
@xerial xerial changed the title agg resolver fix airframe-sql: Fixes #2646 for count(*) queries Dec 17, 2022
@codecov
Copy link

codecov bot commented Dec 17, 2022

Codecov Report

Merging #2649 (900ae00) into master (6b766a6) will increase coverage by 0.01%.
The diff coverage is 87.75%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2649      +/-   ##
==========================================
+ Coverage   82.12%   82.14%   +0.01%     
==========================================
  Files         334      334              
  Lines       14030    14050      +20     
  Branches     2182     2213      +31     
==========================================
+ Hits        11522    11541      +19     
- Misses       2508     2509       +1     
Impacted Files Coverage Δ
.../scala/wvlet/airframe/sql/model/ResolvedPlan.scala 56.25% <0.00%> (ø)
...scala/wvlet/airframe/sql/parser/SQLGenerator.scala 91.03% <ø> (ø)
.../wvlet/airframe/sql/model/LogicalPlanPrinter.scala 96.55% <50.00%> (-3.45%) ⬇️
...in/scala/wvlet/airframe/sql/model/Expression.scala 67.20% <75.00%> (+1.09%) ⬆️
...ala/wvlet/airframe/sql/analyzer/TypeResolver.scala 93.46% <100.00%> (-0.16%) ⬇️
...n/scala/wvlet/airframe/sql/model/LogicalPlan.scala 89.43% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b766a6...900ae00. Read the comment docs.

@xerial xerial requested a review from takezoe December 17, 2022 10:31
case expr => if (isSelectItem) SingleColumn(expr, Some(i.value), None, expr.nodeLocation) else expr
case f: FunctionCall =>
// Extract source columns
val src = Set.newBuilder[SourceColumn]
Copy link
Member Author

@xerial xerial Dec 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SourceColumn contains a TableCatalog parameter, so using Set operation might be a bit heavy if the table schema is large.

// override def dataType: DataType = super.dataType
override def dataType: DataType = {
if (functionName == "count") {
DataType.LongType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

}
col shouldMatch { case ResolvedAttribute("cnt", DataType.LongType, _, sourceColumns, _) => }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, what's sourceColumns here? It might be better to verify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sourceColumns need to be the all of the column data used for generating this ResolvedAttribute. If we use airframe-sql for tracking data lineage, we may need to maintain this property across SQL expression nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if the function is max/min/max_by/min_by etc., source column information will be necessary to show the result contains any PID. If the function is count(1), personal information will be lost, so sourceColumns might be unnecessary because raw column values will not be exposed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourceColumns seems an empty List in this pull request. Is this expected?

Copy link
Member Author

@xerial xerial Dec 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about defining sourceColumns as input columns without any modifications so that data lineage with nested function calls, such as count(max(..(col))), can be traced in a different manner outside TypeResolver?

If it sounds good, I will fix code and tests accordingly. In this case, cnt is from count(), so sourceColumn should be empty, but the function argument (AllColumns) of FunctionCall("count", Seq(AllColumns(, sourceColumns(...)) ) ) should have a source column.

Copy link
Member

@takezoe takezoe Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like AllColumns (in SELECT clause) isn't resolved before cnt (in HAVING clause) is resolved.

Copy link
Member

@takezoe takezoe Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that this query is actually invalid on Presto:

select id, count(*) cnt from A group by id having cnt > 10

Got the following error:

Query 20221219_122609_00005_r3q5y failed: line 1:84: Column 'cnt' cannot be resolved

Meaning we don't need to resolve cnt in HAVING clause in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 😅

Copy link
Member Author

@xerial xerial Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make explicit that ResolvedAttribute.sourceColumns is given only when the attribute is a direct reference of table columns without any change

Oh, I see.

@@ -379,7 +385,27 @@ object TypeResolver extends LogSupport {
case a: Attribute => a
case m: MultiColumn => m
// retain alias for select column
case expr => if (isSelectItem) SingleColumn(expr, Some(i.value), None, expr.nodeLocation) else expr
case f: FunctionCall =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if the same way can be applied to any expression not only FunctionCall. 🤔

Copy link
Member Author

@xerial xerial Dec 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also my concern. Maybe converting every type of expression other than already resolved Attribute/MultiColumn as ResolvedAttribute?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% but maybe that should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked as TODO for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved every matched expressions as ResolvedAttribute here

Base automatically changed from agg-fix to master December 17, 2022 19:39
GroupingKey(e, e.nodeLocation)
})
val resolvedHaving = having.map {
_.transformUpExpression { case x: Expression =>
resolveExpression(context, x, a.outputAttributes, false)
// Having recognize attributes only from the input relation
resolveExpression(context, x, childOutputAttributes, false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part had a bug, which was found in
#2649 (comment)

@xerial xerial changed the title airframe-sql: Fixes #2646 for count(*) queries airframe-sql: Fixes #2646 Resolve input parameters as ResolvedAttributes Dec 19, 2022

val prefix = m match {
case t: TableScan =>
s"${ws}[${m.modelName}] ${t.table.fullName}${functionSig}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For showing the input table name after TableRef is resolved to TableScan

@@ -97,7 +99,7 @@ case class ResolvedAttribute(
case (Some(q), columns) if columns.nonEmpty =>
columns
.map(_.fullName)
.mkString(s"${q},${typeDescription} <- [", ", ", "]")
.mkString(s"${q}.${typeDescription} <- [", ", ", "]")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a typo

@xerial xerial changed the title airframe-sql: Fixes #2646 Resolve input parameters as ResolvedAttributes airframe-sql: Fixes #2646 Resolve AllColumns inputs as ResolvedAttributes Dec 19, 2022
GroupingKey(e, e.nodeLocation)
})
val resolvedHaving = having.map {
_.transformUpExpression { case x: Expression =>
resolveExpression(context, x, a.outputAttributes, false)
// Having recognize attributes only from the input relation
resolveExpression(context, x, childOutputAttributes)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needed to use the resolved child attributes

Copy link
Member

@takezoe takezoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -324,15 +331,14 @@ object TypeResolver extends LogSupport {
private def findMatchInInputAttributes(
context: AnalyzerContext,
expr: Expression,
inputAttributes: Seq[Attribute],
isSelectItem: Boolean
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiColumn is changed to Attribute, so we can use it as SelectItem as is.

@xerial xerial merged commit 7c0616e into master Dec 20, 2022
@xerial xerial deleted the agg-resolver-fix branch December 20, 2022 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants