关于StarRocks物化视图分区刷新的验证
背景
本次验证的场景是订单表基于日期按天分区的场景,主要验证物化视图是否能正常的按天增量刷新,以及验证是否能忽略基表刷新(excluded_refresh_tables)
涉及表:
- 字典表ods_dict、字典物化视图dwd_dict_mv
- 订单表ods_order
- 订单表关联字典表 物化视图 dwd_order_mv1(关联table字典)、dwd_order_mv2(关联mv字典)
版本3.3.6-3.3.8,存算分离
本次验证结论非常惊恐,无法控制基表的刷新导致的增量刷新。
测试数据
- ods_order - 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11- CREATE TABLE IF NOT EXISTS test.ods_order 
 (
 id INT,
 created DATETIME,
 type STRING,
 k1 INT,
 v2 INT
 )PRIMARY KEY (`id`,created)
 PARTITION BY date_trunc('day', `created`);
 insert into ods_order values(1,'2024-11-01 11:17:41',"1",1,1),(2,'2024-11-02 22:17:41',"1",1,1),(3,'2024-11-02 22:17:41',"1",1,1)- 1 
 2
 3
 4- id created type k1 v2 
 1 2024-11-01 11:17:41 1 1 1
 2 2024-11-02 22:17:41 1 1 1
 3 2024-11-02 22:17:41 1 1 1
- ods_dict - 1 
 2
 3
 4
 5
 6
 7
 8- CREATE TABLE IF NOT EXISTS test.ods_dict 
 (
 id INT,
 dict_label STRING,
 dict_value STRING
 )PRIMARY KEY (`id`)
 insert into ods_dict values(1,'电商订单',"1"),(2,'普通订单',"2")- 1 
 2
 3- id dict_label dict_value 
 2 普通订单 2
 1 电商订单 1
- dwd_dict_mv - 建这张表的目的是验证关联table和关联mv对分区刷新的差异,注意这里将刷新时间设置成了10天 - 1 
 2
 3
 4
 5- CREATE MATERIALIZED VIEW dwd_dict_mv 
 DISTRIBUTED BY HASH(id)
 REFRESH ASYNC START('2024-10-01 10:00:00') EVERY (interval 10 DAY)
 AS SELECT *
 FROM ods_dict
创建物化视图
定时3分钟刷新
- dwd_order_mv1关联的ods_dict 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18CREATE MATERIALIZED VIEW dwd_order_mv1 
 COMMENT "dwd-订单"
 REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
 PARTITION BY date_trunc('day', created)
 PROPERTIES(
 "partition_refresh_number" = "30"
 )
 AS
 SELECT
 o.id,
 o.created,
 o.type,
 dict.dict_label as type_label,
 o.k1,
 o.v2
 FROM
 ods_order o
 LEFT JOIN ods_dict dict ON o.type = dict.dict_value
- dwd_order_mv2关联的dwd_dict_mv 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18CREATE MATERIALIZED VIEW dwd_order_mv2 
 COMMENT "dwd-订单"
 REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
 PARTITION BY date_trunc('day', created)
 PROPERTIES(
 "partition_refresh_number" = "30"
 )
 AS
 SELECT
 o.id,
 o.created,
 o.type,
 dict.dict_label as type_label,
 o.k1,
 o.v2
 FROM
 ods_order o
 LEFT JOIN dwd_dict_mv dict ON o.type = dict.dict_value
验证
我的需求是ods_order订单表关联字典清洗一张dwd_order_mv的物化视图,这里分别验证关联 table或mv的差异
- ods_order更新 - 增加1条11月3号的数据 - 1 - insert into ods_order values(4,'2024-11-03 11:17:41',"1",1,1)  - ✅ 结论: dwd_order_mv1、dwd_order_mv2各自增量刷新了11月3号的分区数据,并且在后续没有新增数据的情况下,没有再刷新任何分区数据 
- ods_dict更新 - 1 - UPDATE `ods_dict` SET `dict_label` = '电商订单_new' WHERE id = 1; - 这里我们一定要记得dwd_dict_mv表是设置了10天刷新,所以这时候dwd_dict_mv的数据是还没有更新的 - 1 
 2
 3
 4
 5
 6
 7
 8
 9- --- ods_dict 
 id dict_label dict_value
 1 电商订单_new 1
 2 普通订单 2
 --- dwd_dict_mv
 id dict_label dict_value
 1 电商订单 1
 2 普通订单 2- 那么dwd_order_mv1和dwd_order_mv2的刷新会有什么表现呢? - 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53- -- dwd_order_mv1 
 {
 "forceRefresh": false,
 "mvPartitionsToRefresh": ["p20241102_20241103", "p20241101_20241102", "p20241103_20241104"],
 "refBasePartitionsToRefreshMap": {
 "ods_order": ["p20241101", "p20241102", "p20241103"]
 },
 "basePartitionsToRefreshMap": {
 "ods_dict": ["ods_dict"],
 "ods_order": ["p20241101", "p20241102", "p20241103"]
 },
 "processStartTime": 1733076180617,
 "executeOption": {
 "priority": 0,
 "taskRunProperties": {
 "mvId": "1337190"
 },
 "isMergeRedundant": true,
 "isManual": false,
 "isSync": false,
 "isReplay": false
 },
 "planBuilderMessage": {
 "ods_order": "p20241101,p20241102,p20241103"
 }
 }
 -- dwd_order_mv2
 {
 "forceRefresh": false,
 "mvPartitionsToRefresh": ["p20241102_20241103", "p20241101_20241102", "p20241103_20241104"],
 "refBasePartitionsToRefreshMap": {
 "ods_order": ["p20241101", "p20241102", "p20241103"]
 },
 "basePartitionsToRefreshMap": {
 "dwd_dict_mv": ["dwd_dict_mv"],
 "ods_order": ["p20241101", "p20241102", "p20241103"]
 },
 "processStartTime": 1733076181616,
 "executeOption": {
 "priority": 0,
 "taskRunProperties": {
 "mvId": "1337239"
 },
 "isMergeRedundant": true,
 "isManual": false,
 "isSync": false,
 "isReplay": false
 },
 "planBuilderMessage": {
 "ods_order": "p20241101,p20241102,p20241103"
 }
 }

⚠️结论1:我们惊恐的发现,即使dwd_dict_mv数据还没有因为设置的10天刷新过来,但是,物化视图dwd_order_mv2(ods_order 关联 dwd_dict_mv)却进行了刷新!!

⚠️结论2:接着就是更为糟糕的在每一次dwd_order_mv2物化视图3分钟刷新时间,每一次都进行了全分区刷新!!!
在出现这种异常情况后,我决定试一试excluded_trigger_tables和excluded_refresh_tables
- 重新创建物化视图 - 记得先恢复一下数据 - UPDATEods_dict- SETdict_label- = '电商订单' WHERE id = 1;- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43- ---dwd_order_mv1配置忽略ods_dict更新导致物化视图更新 
 CREATE MATERIALIZED VIEW dwd_order_mv1
 COMMENT "dwd-订单"
 REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
 PARTITION BY date_trunc('day', created)
 PROPERTIES(
 "partition_refresh_number" = "30",
 "excluded_trigger_tables" = "test.ods_dict",
 "excluded_refresh_tables" = "test.ods_dict"
 )
 AS
 SELECT
 o.id,
 o.created,
 o.type,
 dict.dict_label as type_label,
 o.k1,
 o.v2
 FROM
 ods_order o
 LEFT JOIN ods_dict dict ON o.type = dict.dict_value
 --- dwd_order_mv2配置忽略dwd_dict_mv更新导致物化视图更新
 CREATE MATERIALIZED VIEW dwd_order_mv2
 COMMENT "dwd-订单"
 REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
 PARTITION BY date_trunc('day', created)
 PROPERTIES(
 "partition_refresh_number" = "30",
 "excluded_trigger_tables" = "test.dwd_dict_mv",
 "excluded_refresh_tables" = "test.dwd_dict_mv"
 )
 AS
 SELECT
 o.id,
 o.created,
 o.type,
 dict.dict_label as type_label,
 o.k1,
 o.v2
 FROM
 ods_order o
 LEFT JOIN dwd_dict_mv dict ON o.type = dict.dict_value - ⚠️结论3:你猜出现了什么事情,在没有任何基表刷新的时候,dwd_order_mv2居然在第一次全量全分区刷新后的定时刷新中全分区刷新,而且是每一次3分钟刷新物化视图的时候再全量全分区刷新 
- 更新 ods_dict - 1 - UPDATE `ods_dict` SET `dict_label` = '电商订单_new' WHERE id = 1;  - ✅结论4:dwd_order_mv1对excluded_trigger_tables和excluded_refresh_tables的配置生效, 配置的基表刷新了。没有让dwd_order_mv1进行刷新 
 ⚠️结论5:dwd_order_mv2对excluded_trigger_tables和excluded_refresh_tables的配置不生效,永远在全量全分区刷新
结论
如果物化视图关联的基表是一个table,那么和分区刷新相关的都正常,
如果物化视图关联的基表是一个mv,即使这个基表没有任何更新,在第一次刷新后的任意一次定时刷新,都是全量刷新全分区,
这究竟是本身不支持,还是一个究极大BUG?
其他
查看dwd_order_mv2和dwd_order_mv1的最近10次刷新历史
| 1 | (SELECT |