关于StarRocks物化视图分区刷新的验证
背景
本次验证的场景是订单表基于日期按天分区的场景,主要验证物化视图是否能正常的按天增量刷新,以及验证是否能忽略基表刷新(excluded_refresh_tables)
涉及表:
- 字典表ods_dict、字典物化视图dwd_dict_mv
- 订单表ods_order
- 订单表关联字典表 物化视图 dwd_order_mv1(关联table字典)、dwd_order_mv2(关联mv字典)
版本3.3.6
-3.3.8
,存算分离
本次验证结论非常惊恐,无法控制基表的刷新导致的增量刷新。
测试数据
ods_order
1
2
3
4
5
6
7
8
9
10
11CREATE TABLE IF NOT EXISTS test.ods_order
(
id INT,
created DATETIME,
type STRING,
k1 INT,
v2 INT
)PRIMARY KEY (`id`,created)
PARTITION BY date_trunc('day', `created`);
insert into ods_order values(1,'2024-11-01 11:17:41',"1",1,1),(2,'2024-11-02 22:17:41',"1",1,1),(3,'2024-11-02 22:17:41',"1",1,1)1
2
3
4id created type k1 v2
1 2024-11-01 11:17:41 1 1 1
2 2024-11-02 22:17:41 1 1 1
3 2024-11-02 22:17:41 1 1 1ods_dict
1
2
3
4
5
6
7
8CREATE TABLE IF NOT EXISTS test.ods_dict
(
id INT,
dict_label STRING,
dict_value STRING
)PRIMARY KEY (`id`)
insert into ods_dict values(1,'电商订单',"1"),(2,'普通订单',"2")1
2
3id dict_label dict_value
2 普通订单 2
1 电商订单 1dwd_dict_mv
建这张表的目的是验证关联table和关联mv对分区刷新的差异,注意这里将刷新时间设置成了10天
1
2
3
4
5CREATE MATERIALIZED VIEW dwd_dict_mv
DISTRIBUTED BY HASH(id)
REFRESH ASYNC START('2024-10-01 10:00:00') EVERY (interval 10 DAY)
AS SELECT *
FROM ods_dict
创建物化视图
定时3分钟刷新
- dwd_order_mv1
关联的ods_dict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18CREATE MATERIALIZED VIEW dwd_order_mv1
COMMENT "dwd-订单"
REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
PARTITION BY date_trunc('day', created)
PROPERTIES(
"partition_refresh_number" = "30"
)
AS
SELECT
o.id,
o.created,
o.type,
dict.dict_label as type_label,
o.k1,
o.v2
FROM
ods_order o
LEFT JOIN ods_dict dict ON o.type = dict.dict_value - dwd_order_mv2
关联的dwd_dict_mv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18CREATE MATERIALIZED VIEW dwd_order_mv2
COMMENT "dwd-订单"
REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
PARTITION BY date_trunc('day', created)
PROPERTIES(
"partition_refresh_number" = "30"
)
AS
SELECT
o.id,
o.created,
o.type,
dict.dict_label as type_label,
o.k1,
o.v2
FROM
ods_order o
LEFT JOIN dwd_dict_mv dict ON o.type = dict.dict_value
验证
我的需求是ods_order订单表关联字典清洗一张dwd_order_mv的物化视图,这里分别验证关联 table或mv的差异
ods_order更新
增加1条11月3号的数据
1
insert into ods_order values(4,'2024-11-03 11:17:41',"1",1,1)
✅ 结论: dwd_order_mv1、dwd_order_mv2各自增量刷新了11月3号的分区数据,并且在后续没有新增数据的情况下,没有再刷新任何分区数据
ods_dict更新
1
UPDATE `ods_dict` SET `dict_label` = '电商订单_new' WHERE id = 1;
这里我们一定要记得dwd_dict_mv表是设置了10天刷新,所以这时候dwd_dict_mv的数据是还没有更新的
1
2
3
4
5
6
7
8
9--- ods_dict
id dict_label dict_value
1 电商订单_new 1
2 普通订单 2
--- dwd_dict_mv
id dict_label dict_value
1 电商订单 1
2 普通订单 2那么dwd_order_mv1和dwd_order_mv2的刷新会有什么表现呢?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53-- dwd_order_mv1
{
"forceRefresh": false,
"mvPartitionsToRefresh": ["p20241102_20241103", "p20241101_20241102", "p20241103_20241104"],
"refBasePartitionsToRefreshMap": {
"ods_order": ["p20241101", "p20241102", "p20241103"]
},
"basePartitionsToRefreshMap": {
"ods_dict": ["ods_dict"],
"ods_order": ["p20241101", "p20241102", "p20241103"]
},
"processStartTime": 1733076180617,
"executeOption": {
"priority": 0,
"taskRunProperties": {
"mvId": "1337190"
},
"isMergeRedundant": true,
"isManual": false,
"isSync": false,
"isReplay": false
},
"planBuilderMessage": {
"ods_order": "p20241101,p20241102,p20241103"
}
}
-- dwd_order_mv2
{
"forceRefresh": false,
"mvPartitionsToRefresh": ["p20241102_20241103", "p20241101_20241102", "p20241103_20241104"],
"refBasePartitionsToRefreshMap": {
"ods_order": ["p20241101", "p20241102", "p20241103"]
},
"basePartitionsToRefreshMap": {
"dwd_dict_mv": ["dwd_dict_mv"],
"ods_order": ["p20241101", "p20241102", "p20241103"]
},
"processStartTime": 1733076181616,
"executeOption": {
"priority": 0,
"taskRunProperties": {
"mvId": "1337239"
},
"isMergeRedundant": true,
"isManual": false,
"isSync": false,
"isReplay": false
},
"planBuilderMessage": {
"ods_order": "p20241101,p20241102,p20241103"
}
}
⚠️结论1:我们惊恐的发现,即使dwd_dict_mv数据还没有因为设置的10天刷新过来,但是,物化视图dwd_order_mv2(ods_order 关联 dwd_dict_mv)却进行了刷新!!
⚠️结论2:接着就是更为糟糕的在每一次dwd_order_mv2物化视图3分钟刷新时间,每一次都进行了全分区刷新!!!
在出现这种异常情况后,我决定试一试excluded_trigger_tables
和excluded_refresh_tables
重新创建物化视图
记得先恢复一下数据
UPDATE
ods_dictSET
dict_label= '电商订单' WHERE id = 1;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43---dwd_order_mv1配置忽略ods_dict更新导致物化视图更新
CREATE MATERIALIZED VIEW dwd_order_mv1
COMMENT "dwd-订单"
REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
PARTITION BY date_trunc('day', created)
PROPERTIES(
"partition_refresh_number" = "30",
"excluded_trigger_tables" = "test.ods_dict",
"excluded_refresh_tables" = "test.ods_dict"
)
AS
SELECT
o.id,
o.created,
o.type,
dict.dict_label as type_label,
o.k1,
o.v2
FROM
ods_order o
LEFT JOIN ods_dict dict ON o.type = dict.dict_value
--- dwd_order_mv2配置忽略dwd_dict_mv更新导致物化视图更新
CREATE MATERIALIZED VIEW dwd_order_mv2
COMMENT "dwd-订单"
REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
PARTITION BY date_trunc('day', created)
PROPERTIES(
"partition_refresh_number" = "30",
"excluded_trigger_tables" = "test.dwd_dict_mv",
"excluded_refresh_tables" = "test.dwd_dict_mv"
)
AS
SELECT
o.id,
o.created,
o.type,
dict.dict_label as type_label,
o.k1,
o.v2
FROM
ods_order o
LEFT JOIN dwd_dict_mv dict ON o.type = dict.dict_value⚠️结论3:你猜出现了什么事情,在没有任何基表刷新的时候,dwd_order_mv2居然在第一次全量全分区刷新后的定时刷新中全分区刷新,而且是每一次3分钟刷新物化视图的时候再全量全分区刷新
更新 ods_dict
1
UPDATE `ods_dict` SET `dict_label` = '电商订单_new' WHERE id = 1;
✅结论4:dwd_order_mv1对excluded_trigger_tables和excluded_refresh_tables的配置生效, 配置的基表刷新了。没有让dwd_order_mv1进行刷新
⚠️结论5:dwd_order_mv2对excluded_trigger_tables和excluded_refresh_tables的配置不生效,永远在全量全分区刷新
结论
如果物化视图关联的基表是一个table,那么和分区刷新相关的都正常,
如果物化视图关联的基表是一个mv,即使这个基表没有任何更新,在第一次刷新后的任意一次定时刷新,都是全量刷新全分区,
这究竟是本身不支持,还是一个究极大BUG?
其他
查看dwd_order_mv2和dwd_order_mv1的最近10次刷新历史
1 | (SELECT |