背景
我们前期根据kube-prometheus-stack部署实践进行了监控的部署,并且很好的对k8s集群的各项指标进行了grafana可视化监控。
但是我们还有一个监控需求来源于数仓,日常管理数仓中,我会出现如下几个需求点:
- 缓存数据到磁盘,
这个需求源于我们使用的TKE使用的腾讯云的CFS作为存储,而CFS是按量收费的,那么StarRocks缓存到磁盘到底占用的多少磁盘空间,以及是否需要清理,就迫在眉睫
- 数仓与对象储存流量情况
我们需要日常关注StarRocks与对象存储的流量带宽情况
- 物化视图的成功与否及监控告警
StarRocks中创建了非常多的物化视图,而这些物化视图的成功失败及时间节点,需要更好的监控到位
基于以上需求,我们来尝试解决这些问题
StarRocks配置prometheus metrics scrape
根据 StarRocks Cluster Integration With Prometheus and Grafana Service 指南,我们先给StarRocks配置好 prometheus metrics scrape
我是根据operator安装的而非helm,所以根据文档我的配置如下:
重点关注spec.starRocksBeSpec.service.annotations
、spec.starRocksFeSpec.service.annotations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| piVersion: starrocks.com/v1 kind: StarRocksCluster metadata: name: kube-starrocks namespace: default spec: starRocksBeSpec: configMapInfo: configMapName: kube-starrocks-be-cm resolveKey: be.conf image: starrocks/be-ubuntu:3.3-latest limits: cpu: 4 memory: 4Gi replicas: 1 requests: cpu: 1 memory: 2Gi service: annotations: prometheus.io/path: /metrics prometheus.io/port: "8040" prometheus.io/scrape: "true" starRocksFeSpec: configMapInfo: configMapName: kube-starrocks-fe-cm resolveKey: fe.conf image: starrocks/fe-ubuntu:3.3-latest limits: cpu: 4 memory: 4Gi replicas: 1 requests: cpu: 1 memory: 2Gi service: annotations: prometheus.io/path: /metrics prometheus.io/port: "8030" prometheus.io/scrape: "true"
|
根据 Service 注解动态采集 参考
prometheus-additional.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
| - job_name: 'StarRocks_Cluster' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: keep regex: starrocks # 过滤starrocks命名空间 - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| scrape_configs: - job_name: starrocks-fe-monitor honor_labels: true scrape_interval: 15s metrics_path: /metrics scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - starrocks relabel_configs: - source_labels: - __meta_kubernetes_endpoint_port_name regex: http action: keep - source_labels: - __meta_kubernetes_service_name regex: starrockscluster-fe-service action: keep - source_labels: - __meta_kubernetes_pod_node_name target_label: node - source_labels: - __meta_kubernetes_namespace target_label: namespace - source_labels: - __meta_kubernetes_service_name target_label: service - source_labels: - __meta_kubernetes_pod_name target_label: pod
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| scrape_configs: - job_name: starrocks-be-monitor honor_labels: true scrape_interval: 15s metrics_path: /metrics scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - starrocks relabel_configs: - source_labels: - __meta_kubernetes_endpoint_port_name regex: webserver action: keep - source_labels: - __meta_kubernetes_service_name regex: starrockscluster-cn-service action: keep - source_labels: - __meta_kubernetes_pod_node_name target_label: node - source_labels: - __meta_kubernetes_namespace target_label: namespace - source_labels: - __meta_kubernetes_service_name target_label: service - source_labels: - __meta_kubernetes_pod_name target_label: pod
|
kube-prometheus-stack 采集配置方法
如果你使用 kube-prometheus-stack 来安装 Prometheus,需要在 additionalScrapeConfigs或者additionalScrapeConfigsSecret里加上采集配置,示例:
- 在additionalScrapeConfigsSecret配置
1
| kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
|
1 2 3 4 5
| additionalScrapeConfigsSecret: enabled: true name: additional-configs key: prometheus-additional.yaml
|
- 在additionalScrapeConfigs配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
| prometheus: prometheusSpec: additionalScrapeConfigs: - job_name: 'StarRocks_Cluster' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: keep regex: starrocks # 过滤starrocks命名空间 - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name
|
配置好后,我们到Prometheus web界面观察,发现已经正常在采集了。
Grafana 监控可视化展示 坑
按照文档Import StarRocks Grafana Dashboard,导入Grafana模板,发现毛数据都木有,哈哈哈🤣,至此等待StarRocks官方修复。
我们来试试其他几个模板
Dashboard 模板
其他参考
Kubernetes 监控:Prometheus Operator