Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature](mtmv) pick some mtmv pr from master #37651

Conversation

seawinde
Copy link
Contributor

@seawinde seawinde commented Jul 11, 2024

Proposed changes

cherry-pick to 2.1
pr: #36318
commitId: c199947

pr: #36111
commitId: 35ebef6

pr: #36175
commitId: 4c8e66b

pr: #36414
commitId: 5e009b5

pr: #36770
commitId: 19e2126

pr: #36567
commitId: 3da8351

…e function is distinct (apache#36318)

## Proposed changes

This extend the query rewrite by materialized view ability
For example mv def is
>            CREATE MATERIALIZED VIEW mv1
>             BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL
>             DISTRIBUTED BY RANDOM BUCKETS 2
>             PROPERTIES ('replication_num' = '1') 
>             AS
>              select
>              count(o_totalprice),
>              o_shippriority,
>              o_orderstatus,
>              bin(o_orderkey)
>              from orders
>              group by
>              o_orderstatus,
>              o_shippriority,
>              bin(o_orderkey);

the query as following can be rewritten by materialized view
successfully
though `sum(distinct o_shippriority)` in query is not appear in mv
output, but query aggregate function is distinct and it use
the group by dimension in mv, in this scene, the `sum(distinct
o_shippriority)` can use mv group dimension `o_shippriority`
directly and the result is true.

Suppport the following distinct aggregate function currently, others are
supported in the furture on demand

- max(distinct arg)
- min(distinct arg)
- sum(distinct arg)
- avg(distinct arg)
- count(distinct arg)

>             select 
>             count(o_totalprice),
>              max(distinct o_shippriority),
>              min(distinct o_shippriority),
>              avg(distinct o_shippriority),
> sum(distinct o_shippriority) / count(distinct o_shippriority)
>              o_orderstatus,
>              bin(o_orderkey)
>              from orders
 >             group by
 >             o_orderstatus,
 >             bin(o_orderkey);
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@seawinde
Copy link
Contributor Author

run buildall

seawinde added 5 commits July 11, 2024 15:00
…async mv (apache#36111)

Support to use current_date() when create async materialized view by
adding
'enable_nondeterministic_function' = 'true' in properties when create
materialized view. `enable_nondeterministic_function` is default false.

Here is a example, it will success

>        CREATE MATERIALIZED VIEW mv_name
>        BUILD DEFERRED REFRESH AUTO ON MANUAL
>        DISTRIBUTED BY RANDOM BUCKETS 2
>        PROPERTIES (
>        'replication_num' = '1',
>        'enable_nondeterministic_function' = 'true'
>        )
>        AS
>       SELECT *, unix_timestamp(k3, '%Y-%m-%d %H:%i-%s') from ${tableName} where current_date() > k3;

Note:
unix_timestamp is nondeterministic when has no params. it is
deterministic when has params which means format column k3 date

another example, it will success

>        CREATE MATERIALIZED VIEW mv_name
>        BUILD DEFERRED REFRESH AUTO ON MANUAL
>        DISTRIBUTED BY RANDOM BUCKETS 2
>        PROPERTIES (
>        'replication_num' = '1',
>        'enable_nondeterministic_function' = 'true'
>        )
>        AS
>       SELECT *, unix_timestamp() from ${tableName} where current_date() > k3;

though unix_timestamp() is nondeterministic, we add
'enable_date_nondeterministic_function' = 'true' in properties
…by (apache#36175)

This is brought by apache#35562

At the pr above when you create partition materialized view as
following, which would fail with the message:
Unable to find a suitable base table for partitioning

CREATE MATERIALIZED VIEW mvName
BUILD IMMEDIATE REFRESH AUTO ON MANUAL
PARTITION BY (date_trunc(month_alias, 'month'))
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES (
  'replication_num' = '1'
)
AS
SELECT date_trunc(`k2`,'day') AS month_alias, k3, count(*) 
FROM tableName GROUP BY date_trunc(`k2`,'day'), k3;

This pr supports to create partition materialized view when `date_trunc`
in group by cluause.
… rewrite by partition rolled up mv (apache#36414)

This is brought by apache#35562 

When mv is partition rolled up mv, which is rolled up by date_trunc. If
base table add new partition.
if query rewrite successfully by the partition mv, the data will lost
the new partition data. This pr fix this problem. For example as following:

mv def is:

CREATE MATERIALIZED VIEW roll_up_mv
BUILD IMMEDIATE REFRESH AUTO ON MANUAL
partition by (date_trunc(`col1`, 'month'))
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES ('replication_num' = '1')
AS
select date_trunc(`l_shipdate`, 'day') as col1, l_shipdate, o_orderdate, l_partkey,
   l_suppkey, sum(o_totalprice) as sum_total
   from lineitem
left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate
   group by
   col1,
   l_shipdate,
   o_orderdate,
   l_partkey,
   l_suppkey;

if run the insert comand

insert into lineitem values
    (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-11-21', '2023-11-21', '2023-11-21', 'a', 'b', 'yyyyyyyyy');

then run query as following, result will not return the 2023-11-21 partition data

select date_trunc(`l_shipdate`, 'day') as col1, l_shipdate, o_orderdate, l_partkey,
   l_suppkey, sum(o_totalprice) as sum_total
   from lineitem
left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate
   group by
   col1,
   l_shipdate,
   o_orderdate,
   l_partkey,
   l_suppkey;
…stability (apache#36770)

When union rewrite by materialized view, the final plan chosen by CBO is
instability.
So the regression test only check mv is rewritten successful or not,
doesn't check is chosen by CBO or not.
Optimize to make sure chosen by CBO would be anther pr to fix this
thoroughly。
…n, because low level mv aggregate roll up (apache#36567)

Query is aggregate, the query group by expression is less than
materialzied view group by expression.
when the more dimensions than queries in materialzied view can be
eliminated with functional dependencies.
it can be rewritten with out roll up aggregate.
For example as following:
mv def is 

CREATE MATERIALIZED VIEW mv
BUILD IMMEDIATE REFRESH AUTO ON MANUAL
DISTRIBUTED BY RANDOM BUCKETS 2
PROPERTIES ('replication_num' = '1')
AS 
select 
  l_orderkey, 
  l_partkey, 
  l_suppkey, 
  o_orderkey, 
  o_custkey, 
  ps_partkey, 
  cast(
    sum(
      IFNULL(ps_suppkey, 0) * IFNULL(ps_partkey, 0)
    ) as decimal(28, 8)
  ) as agg2 
from 
  lineitem_1 
  inner join orders_1 on lineitem_1.l_orderkey = orders_1.o_orderkey
  inner join partsupp_1 on l_partkey = partsupp_1.ps_partkey 
  and l_suppkey = partsupp_1.ps_suppkey 
where 
  partsupp_1.ps_suppkey > 1 
group by 
  l_orderkey, 
  l_partkey, 
  l_suppkey, 
  o_orderkey, 
  o_custkey, 
  ps_partkey;

query is as following:

select 
  l_orderkey, 
  l_partkey, 
  l_suppkey, 
  o_orderkey, 
  o_custkey, 
  cast(
    sum(
      IFNULL(ps_suppkey, 0) * IFNULL(ps_partkey, 0)
    ) as decimal(28, 8)
  ) as agg2 
from 
  lineitem_1 
  inner join orders_1 on lineitem_1.l_orderkey = orders_1.o_orderkey
  inner join partsupp_1 on l_partkey = partsupp_1.ps_partkey 
  and l_suppkey = partsupp_1.ps_suppkey 
where 
  partsupp_1.ps_suppkey > 1 
group by 
  l_orderkey, 
  l_partkey, 
  l_suppkey, 
  o_orderkey, 
  o_custkey;

we can see that query doesn't use `ps_partkey` which is in mv group by
expression.
Normally will add roll up aggragate on materialized view if the gorup by
dimension in mv is mucher than query group by dimension.
And, in this scane we can get the function dependency on `l_suppkey =
ps_suppkey `. and we doesn't need to add roll up aggregate on
materialized view in rewritten plan. this improve performance and is
beneficial for nest materialized view rewrite.
@seawinde
Copy link
Contributor Author

run buildall

@morrySnow morrySnow changed the title [feature](mtmv) Support to use mv group dimension when query aggregate function is distinct (#36318) [feature](mtmv) pick some mtmv pr from master Jul 12, 2024
@morrySnow morrySnow merged commit ffa9e49 into apache:branch-2.1 Jul 12, 2024
22 of 23 checks passed
@yiguolei yiguolei mentioned this pull request Jul 19, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants