关于StarRocks物化视图分区刷新的验证

背景

本次验证的场景是订单表基于日期按天分区的场景,主要验证物化视图是否能正常的按天增量刷新,以及验证是否能忽略基表刷新(excluded_refresh_tables)

涉及表:

  • 字典表ods_dict、字典物化视图dwd_dict_mv
  • 订单表ods_order
  • 订单表关联字典表 物化视图 dwd_order_mv1(关联table字典)、dwd_order_mv2(关联mv字典)

版本3.3.6-3.3.8,存算分离

本次验证结论非常惊恐,无法控制基表的刷新导致的增量刷新。

测试数据

  • ods_order

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    CREATE TABLE IF NOT EXISTS test.ods_order
    (
    id INT,
    created DATETIME,
    type STRING,
    k1 INT,
    v2 INT
    )PRIMARY KEY (`id`,created)
    PARTITION BY date_trunc('day', `created`);

    insert into ods_order values(1,'2024-11-01 11:17:41',"1",1,1),(2,'2024-11-02 22:17:41',"1",1,1),(3,'2024-11-02 22:17:41',"1",1,1)
    1
    2
    3
    4
    id	created	type	k1	v2
    1 2024-11-01 11:17:41 1 1 1
    2 2024-11-02 22:17:41 1 1 1
    3 2024-11-02 22:17:41 1 1 1
  • ods_dict

    1
    2
    3
    4
    5
    6
    7
    8
    CREATE TABLE IF NOT EXISTS test.ods_dict
    (
    id INT,
    dict_label STRING,
    dict_value STRING
    )PRIMARY KEY (`id`)

    insert into ods_dict values(1,'电商订单',"1"),(2,'普通订单',"2")
    1
    2
    3
    id	dict_label	dict_value
    2 普通订单 2
    1 电商订单 1
  • dwd_dict_mv

    建这张表的目的是验证关联table和关联mv对分区刷新的差异,注意这里将刷新时间设置成了10天

    1
    2
    3
    4
    5
    CREATE MATERIALIZED VIEW dwd_dict_mv
    DISTRIBUTED BY HASH(id)
    REFRESH ASYNC START('2024-10-01 10:00:00') EVERY (interval 10 DAY)
    AS SELECT *
    FROM ods_dict

创建物化视图

定时3分钟刷新

  • dwd_order_mv1

    关联的ods_dict

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    CREATE MATERIALIZED VIEW dwd_order_mv1
    COMMENT "dwd-订单"
    REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
    PARTITION BY date_trunc('day', created)
    PROPERTIES(
    "partition_refresh_number" = "30"
    )
    AS
    SELECT
    o.id,
    o.created,
    o.type,
    dict.dict_label as type_label,
    o.k1,
    o.v2
    FROM
    ods_order o
    LEFT JOIN ods_dict dict ON o.type = dict.dict_value
  • dwd_order_mv2

    关联的dwd_dict_mv

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    CREATE MATERIALIZED VIEW dwd_order_mv2
    COMMENT "dwd-订单"
    REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
    PARTITION BY date_trunc('day', created)
    PROPERTIES(
    "partition_refresh_number" = "30"
    )
    AS
    SELECT
    o.id,
    o.created,
    o.type,
    dict.dict_label as type_label,
    o.k1,
    o.v2
    FROM
    ods_order o
    LEFT JOIN dwd_dict_mv dict ON o.type = dict.dict_value

验证

我的需求是ods_order订单表关联字典清洗一张dwd_order_mv的物化视图,这里分别验证关联 table或mv的差异

  • ods_order更新

    增加1条11月3号的数据

    1
    insert into ods_order values(4,'2024-11-03 11:17:41',"1",1,1)

    T4Ag4M

    ✅ 结论: dwd_order_mv1、dwd_order_mv2各自增量刷新了11月3号的分区数据,并且在后续没有新增数据的情况下,没有再刷新任何分区数据

  • ods_dict更新

    1
    UPDATE `ods_dict` SET `dict_label` = '电商订单_new' WHERE id = 1;

    这里我们一定要记得dwd_dict_mv表是设置了10天刷新,所以这时候dwd_dict_mv的数据是还没有更新的

    1
    2
    3
    4
    5
    6
    7
    8
    9
    --- ods_dict
    id dict_label dict_value
    1 电商订单_new 1
    2 普通订单 2

    --- dwd_dict_mv
    id dict_label dict_value
    1 电商订单 1
    2 普通订单 2

    那么dwd_order_mv1和dwd_order_mv2的刷新会有什么表现呢?

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    -- dwd_order_mv1
    {
    "forceRefresh": false,
    "mvPartitionsToRefresh": ["p20241102_20241103", "p20241101_20241102", "p20241103_20241104"],
    "refBasePartitionsToRefreshMap": {
    "ods_order": ["p20241101", "p20241102", "p20241103"]
    },
    "basePartitionsToRefreshMap": {
    "ods_dict": ["ods_dict"],
    "ods_order": ["p20241101", "p20241102", "p20241103"]
    },
    "processStartTime": 1733076180617,
    "executeOption": {
    "priority": 0,
    "taskRunProperties": {
    "mvId": "1337190"
    },
    "isMergeRedundant": true,
    "isManual": false,
    "isSync": false,
    "isReplay": false
    },
    "planBuilderMessage": {
    "ods_order": "p20241101,p20241102,p20241103"
    }
    }

    -- dwd_order_mv2
    {
    "forceRefresh": false,
    "mvPartitionsToRefresh": ["p20241102_20241103", "p20241101_20241102", "p20241103_20241104"],
    "refBasePartitionsToRefreshMap": {
    "ods_order": ["p20241101", "p20241102", "p20241103"]
    },
    "basePartitionsToRefreshMap": {
    "dwd_dict_mv": ["dwd_dict_mv"],
    "ods_order": ["p20241101", "p20241102", "p20241103"]
    },
    "processStartTime": 1733076181616,
    "executeOption": {
    "priority": 0,
    "taskRunProperties": {
    "mvId": "1337239"
    },
    "isMergeRedundant": true,
    "isManual": false,
    "isSync": false,
    "isReplay": false
    },
    "planBuilderMessage": {
    "ods_order": "p20241101,p20241102,p20241103"
    }
    }

E2VUMK

⚠️结论1:我们惊恐的发现,即使dwd_dict_mv数据还没有因为设置的10天刷新过来,但是,物化视图dwd_order_mv2(ods_order 关联 dwd_dict_mv)却进行了刷新!!

WxEW5E

⚠️结论2:接着就是更为糟糕的在每一次dwd_order_mv2物化视图3分钟刷新时间,每一次都进行了全分区刷新!!!

在出现这种异常情况后,我决定试一试excluded_trigger_tablesexcluded_refresh_tables

  • 重新创建物化视图

    记得先恢复一下数据 UPDATE ods_dictSETdict_label = '电商订单' WHERE id = 1;

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    ---dwd_order_mv1配置忽略ods_dict更新导致物化视图更新
    CREATE MATERIALIZED VIEW dwd_order_mv1
    COMMENT "dwd-订单"
    REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
    PARTITION BY date_trunc('day', created)
    PROPERTIES(
    "partition_refresh_number" = "30",
    "excluded_trigger_tables" = "test.ods_dict",
    "excluded_refresh_tables" = "test.ods_dict"
    )
    AS
    SELECT
    o.id,
    o.created,
    o.type,
    dict.dict_label as type_label,
    o.k1,
    o.v2
    FROM
    ods_order o
    LEFT JOIN ods_dict dict ON o.type = dict.dict_value

    --- dwd_order_mv2配置忽略dwd_dict_mv更新导致物化视图更新
    CREATE MATERIALIZED VIEW dwd_order_mv2
    COMMENT "dwd-订单"
    REFRESH ASYNC START('2024-10-01 10:00:00') EVERY(INTERVAL 3 MINUTE)
    PARTITION BY date_trunc('day', created)
    PROPERTIES(
    "partition_refresh_number" = "30",
    "excluded_trigger_tables" = "test.dwd_dict_mv",
    "excluded_refresh_tables" = "test.dwd_dict_mv"
    )
    AS
    SELECT
    o.id,
    o.created,
    o.type,
    dict.dict_label as type_label,
    o.k1,
    o.v2
    FROM
    ods_order o
    LEFT JOIN dwd_dict_mv dict ON o.type = dict.dict_value

    MCc8ha

    ⚠️结论3:你猜出现了什么事情,在没有任何基表刷新的时候,dwd_order_mv2居然在第一次全量全分区刷新后的定时刷新中全分区刷新,而且是每一次3分钟刷新物化视图的时候再全量全分区刷新

  • 更新 ods_dict

    1
    UPDATE `ods_dict` SET `dict_label` = '电商订单_new' WHERE id = 1;

    dChmrL

    ✅结论4:dwd_order_mv1对excluded_trigger_tables和excluded_refresh_tables的配置生效, 配置的基表刷新了。没有让dwd_order_mv1进行刷新
    ⚠️结论5:dwd_order_mv2对excluded_trigger_tables和excluded_refresh_tables的配置不生效,永远在全量全分区刷新

结论

如果物化视图关联的基表是一个table,那么和分区刷新相关的都正常,
如果物化视图关联的基表是一个mv,即使这个基表没有任何更新,在第一次刷新后的任意一次定时刷新,都是全量刷新全分区,

这究竟是本身不支持,还是一个究极大BUG?

其他

查看dwd_order_mv2和dwd_order_mv1的最近10次刷新历史

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
(SELECT
TASK_NAME,
CREATE_TIME,
DEFINITION,
EXTRA_MESSAGE
FROM
information_schema.task_runs
WHERE
task_name IN (
SELECT
TASK_NAME
FROM
information_schema.materialized_views
WHERE
TABLE_NAME IN ( "dwd_order_mv2" ))
ORDER BY
TASK_NAME ASC,
CREATE_TIME DESC
LIMIT 10 )
UNION ALL
(SELECT
TASK_NAME,
CREATE_TIME,
DEFINITION,
EXTRA_MESSAGE
FROM
information_schema.task_runs
WHERE
task_name IN (
SELECT
TASK_NAME
FROM
information_schema.materialized_views
WHERE
TABLE_NAME IN ( "dwd_order_mv1" ))
ORDER BY
TASK_NAME ASC,
CREATE_TIME DESC
LIMIT 10)
ORDER BY
TASK_NAME ASC,
CREATE_TIME DESC