Answer Query Using Materialized Views.

AQUMV for short, is used to compute part or all of a Query from materialized views during planning. It could provide massive improvements in query processing time, especially for aggregation queries over large tables[1]. AQUMV usually uses Incremental Materialized Views(IMV) as candidates, as IMV usually have real time data when there are writable operations on related tables. AQUMV is actually a Equivalent Transformation on Query tree. A materialized view(MV) could be use to compute a Query if: 1.The view contains all rows needed by the query expression. If MV has more rows than query wants, additional filter may be added if possible. 2.All output expressions can be computed from the output of the view. The output expressions could be fully or partially matched from MV's target list. 3.Cost-based. There may be multiple valid MV candidates, or select from MV is not better than select from origin table(ex: has an index and etc), let planner decide the best one. Construct rows by splitting MV query quals(mv_query_quals) and Query quals (origin_query_quals) to difference set and intersection set. And post_quals formed by:{origin_query_quals - mv_query_quals} will be processed by MV query's target list, and rewritten to MV relation's target list expressions. Construct columns expressions using a Greedy Algorithm. Sort the MV query's target list by complexity, and try to rewrite expressions by that order. Expressions that have no Vars are kept to upper(Const Expressions) or be rewritten if there were corresponding expressions. Reference: [1] Optimizing Queries Using Materialized Views: A Practical, Scalable Solution. https://courses.cs.washington.edu/courses/cse591d/01sp/opt_views.pdf Authored-by: Zhang Mingli [email protected]
apache · Dec 1, 2023 · 7522c1c · 7522c1c
1 parent 2bf926f
commit 7522c1c
Show file tree

Hide file tree

Showing 9 changed files with 1,976 additions and 1 deletion.
diff --git a/src/backend/optimizer/README.cbdb.aqumv b/src/backend/optimizer/README.cbdb.aqumv
@@ -0,0 +1,241 @@
+src/backend/optimizer/README.cbdb.aqumv
+
+Portions Copyright (c) 2023, HashData Technology Limited.
+
+Author
+============
+Zhang Mingli   [email protected]
+
+
+Answer Query Using Materialized Views
+=====================================
+
+AQUMV for short, is used to compute part or all of a Query from materialized views during planning.
+It could provide massive improvements in query processing time, especially for aggregation queries over large tables[1].
+
+AQUMV usually uses Incremental Materialized Views(IMV) as candidates, as IMV has real time data
+when there are writable operations on related tables.
+
+Basic Theory
+------------
+
+A materialized view(MV) could be use to compute a Query if:
+1. The view contains all rows needed by the query expression(Construct Rows).
+      If MV has more rows than query wants, additional filter may be added if possible.
+2. All output expressions can be computed from the output of the view(Construct Columns).
+      The output expressions could be fully or partially matched from MV's TargetList.
+3. Cost-based Equivalent Transformation.
+      There may be multiple valid MV candidates, or select from MV is not better than
+      select from origin table(ex: has an index and etc), let planner decide the best one.
+
+Vocabulary:
+      origin_query: the SQL we want to query.
+      mv_query: for materialized view's corresponding query, the SELECT part of a Create Materialized View.
+
+Construct Rows
+--------------
+
+If MV has all rows of query needed, it means that MV query's restrictions are looser than query's restrictions.
+For AQUMV_MVP0, we only do logistic transformation.
+All rewrites are on the Query tree, neither Equivalent Classes nor Restrictions are used.
+For a single relation:
+process mv_query and origin_query's WHERE part to set:
+mv_query_quals and origin_query_quals.
+
+example0:
+      CREATE MATERIALIZED VIEW mv0 AS SELECT * FROM t WHERE a = 1 AND b = 2;
+      Query: SELECT * FROM t WHERE a = 1;
+
+mv_query_quals = {a = 1, b = 2}.
+origin_query_quals = {a = 1}.
+
+1) A MV can't be used if the difference set:  {mv_query_quals - origin_query_quals} is not empty.
+
+It 'typically' means that the MV has less rows than origin_query wants.
+For example0, the difference set is:
+      mv_query_quals - origin_query_quals = {b = 2}.
+
+mv0's all rows meet requirement {a = 1 and b = 2}, but we only want rows {a = 1}.
+mv0 couldn't provide all the rows we want, we can't use it to answer the query.
+
+'typically' means that if there are range quals, this conclusion is not sure.
+But we couldn't handle that for now.
+
+2) The intersection set: {mv_query_quals ∩ origin_query_quals} should be dropped.
+
+If the intersection set is not empty, we choose to drop it.
+example1:
+      CREATE MATERIALIZED VIEW mv1 AS SELECT * FROM t WHERE a = 1;
+      Query: SELECT * FROM t WHERE a = 1;
+
+{mv_query_quals ∩ origin_query_quals} = {a = 1};
+
+It seems everything is good and we have nothing more to do.
+Because the two quals are the same and we could rewrite the SQL to:
+      Rewritten SQL: SELECT * FROM mv1 WHERE a = 1;
+
+As all mv1's rows meet the requirement: a = 1, it's pointless that we do filter a = 1 again at execution.
+
+What's worse is the unnecessary filter {a = 1} will mislead the clause-selectivity of the relation.
+For the example1, a {a = 1} will estimate less rows from relation MV, but as we are clear that all rows
+meet the requirement, and the selectivity from mv1 should be 100%.
+
+Another reason we dropped the intersection set is: we couldn't just keep the intersection set.
+example2:
+      CREATE MATERIALIZED VIEW mv2 AS SELECT b FROM t WHERE a = 1;
+      Query: SELECT b FROM t WHERE a = 1;
+
+{mv_query_quals ∩ origin_query_quals} = {a = 1};
+
+mv2 and origin_query only select column b from t with the same quals {a = 1}.
+If the intersection set is kept, we will get a wrong SQL:
+      Wrong: SELECT b FROM mv2 WHERE a = 1;
+
+mv2 doesn't have column a, the SQL will get a syntax error.
+
+It's not always impossible to keep the intersection set, for
+example3:
+      CREATE MATERIALIZED VIEW mv3 AS SELECT a, b FROM t WHERE a = 1;
+      Query: SELECT a, b FROM t WHERE a = 1;
+
+We could rewrite it to:
+      SELECT a, b FROM mv3 WHERE a = 1;
+
+There is a way to see if it's possible to rewrite that, but it isn't worth trying according to
+what we mentioned above.
+
+The disadvantages of dropping the intersection set of mv_query_quals and origin_query_quals is:
+We may lose some Equivalent Classes if there are equal operations like: a = 1.
+But not for other operations, ex: c > 1, because Postgres only have Equivalent Class for equal operations.
+And we haven't taken Equivalent Class into account for AQUMV_MVP0, it's reasonable to drop that.
+
+3) process difference set: {origin_query_quals - mv_query_quals}
+If 1) and 2) passed, the difference set on the other hand, we call it post_quals:
+
+      post_quals = {origin_query_quals - mv_query_quals}
+
+The MV has more rows than query if post_quals is not empty.
+We have to add it to MV to filter the rows query want.
+example4:
+      CREATE MATERIALIZED VIEW mv4 AS SELECT a, b FROM t WHERE a = 1;
+      Query: SELECT a, b FROM t WHERE a = 1 and b = 2;
+
+We could rewrite it to:
+      SELECT a, b FROM mv4 WHERE b = 2;
+
+All rows in MV are {a = 1} ones as the MV defination, we only need to add the extra filter {b = 2}.
+
+But it's not always true, if we don't have the columns that the post_quals need.
+example5:
+      CREATE MATERIALIZED VIEW mv5 AS SELECT a FROM t WHERE a = 1;
+      Query: SELECT a FROM t WHERE a = 1 and b = 2;
+
+mv5 has all rows {a = 1} and only have column 'a', but the query want additional filter {b = 2}.
+We couldn't rewrite it by just adding the {b = 2} to MV as no equivalent b in MV relation.
+      Wrong: SELECT a FROM mv5 WHERE b = 2;
+
+The algorithem behind that is: all quals's expression could be computed from a mv_query's target list.
+That's what Construct Columns does.
+
+Construct Columns
+-----------------
+
+A MV could be a candidate if the query's target list and the post_quals could be computed form
+mv_query's target list and rewrite to expressions bases on MV relation's columns itself.
+
+example6:
+      CREATE MATERIALIZED VIEW mv6 AS SELECT abs(c) as mc1, b as mc2 FROM t WHERE a = 1;
+      Query: SELECT abs(c) as res1 FROM t WHERE a = 1 and b = 2;
+
+The post_quals is: {b = 2} while column b exists in mv6, corresponding to mc2 with alias.
+We can rewrite post_quals to {mc2 = 2}.
+
+The query wants a target abs(c) with an alias res1, while expression abs(c) exists in mv6,
+corresponding to column mc1 with alias.
+Then we can rewrite SQL to:
+
+      Rewrite: SELECT mc1 as res1 FROM mv6 WHERE mc2 = 2;
+
+The expression abs(c) is eliminated and simplified to a column reference mc1, and the alias is kept.
+
+Things become complex when there are multiple expression candidates, and some ones could be
+part of others.
+
+example7:
+      CREATE MATERIALIZED VIEW mv7 AS
+            SELECT c1 AS mc1, c2 AS mc2, abs(c2) AS mc3, abs(abs(c2) - c1 - 1) AS mc4
+            FROM t1 WHERE c1 > 30 AND c1 < 40;
+      Query: SELECT sqrt(abs(abs(c2) - c1 - 1) + abs(c2)) AS res1 FROM t1 WHERE c1 > 30 AND c1 < 40 AND c2 > 23;
+
+There are many choices to construct column res1:
+      sqrt(abs(abs(mc2) - mc1 - 1) + abs(mc2))  // constructed by mc1, mc2
+      sqrt(abs(abs(mc2) - mc1 - 1) + mc3))      // constructed by mc1, mc2, mc3
+      sqrt(abs(mc3 - mc1 - 1) + abs(mc2)))      // constructed by mc1, mc2, mc3
+      sqrt(abs(mc3 - mc1 - 1) + mc3))           // constructed by mc1, mc3
+      sqrt(mc4 + mc3))                          // constructed by mc3, mc4
+
+Obviously, the best one is sqrt(mc4 + mc3) which avoids much of expression execution for each row.
+
+We try to use the most-submatched expression to do a first rewrite and then next.
+It's not only optimization, but also unnecessary for some cases that a less-matched expression
+rewrite may close the door for more-matched ones, especially for post_quals rewrite.
+example8:
+      CREATE MATERIALIZED VIEW mv8 AS
+            SELECT c2 as mc3, c2 AS mc2, abs(c2) AS m_abs_c2
+            FROM t1 WHERE c1 > 1;
+      Query: SELECT c3 AS res1 FROM t1 WHERE c1 > 1 and (abs(c2) - c1 - 1) > 10;
+
+      post_quals: {(abs(c2) - c1 - 1) > 10}
+
+If we choose less-matched mc2 to rewrite, an intermediate expression would be:
+      {(abs(mc2) - c1 - 1) > 10}
+
+But mv8 don't have a corresponding column c1 to continue the work, that's bad and we will lose
+the chance to use it.
+
+The approach is: use a Greedy Algorithm to rewrite the target expression.
+
+First,  Split the MV query's expressions to pure-Var and nonpure-Var ones.
+Because pure Var expression is always the leaf of an expression tree if it needs to be rewritten.
+
+Sort the nonpure-Var expressions by complexity.
+We don't need an absolute order for every expression.
+All we need to guarantee is that:
+if expression A is sub part of expression B, put A after B.
+
+The approach applies to post_quals rewrite too.
+
+Expressions that have no Vars are kept to upper(ex: Const Expressions) or rewritten if there were
+corresponding expressions.
+
+
+Cost-based
+----------
+
+There could be multiple candidates after equivalent transformation.
+After all things is done for a materialized view candidate, build a plan to compare with current one.
+Let the planner decide the best one.
+
+AQUMV_MVP
+---------
+Support SELECT FROM a single relation both for mv_query and the origin_query.
+Below are not supported now:
+      AGG
+      Subquery
+      Order by(for origin_query)
+      Join
+      Sublink
+      Group by
+      Window Functions
+      CTE
+      Distinct On
+      Refresh Materialized View
+      Create AS
+      Partition Tables
+      Inherit Tables
+
+Reference:
+   [1] Optimizing Queries Using Materialized Views: A Practical, Scalable Solution.
+         https://courses.cs.washington.edu/courses/cse591d/01sp/opt_views.pdf
+   [2] Automated Selection of Materialized Views and Indexes for SQL Databases.
+         https://www.vldb.org/conf/2000/P496.pdf
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
@@ -240,7 +240,12 @@ query_planner(PlannerInfo *root,
 	 * generate pathkeys in canonical form; so compute query_pathkeys and
 	 * other pathkeys fields in PlannerInfo.
 	 */
+	/* AQUMV_FIXME_MVP: qp_callback may be NULL. */
+#if 0
 	(*qp_callback) (root, qp_extra);
+#endif
+	if (qp_callback != NULL)
+		(*qp_callback) (root, qp_extra);
 
 	/*
 	 * Examine any "placeholder" expressions generated during subquery pullup.