本文主要组件安装均使用Qist大佬仓库整理的yaml,Qist仓库地址如下:
1 |
https://github.com/qist/k8s/tree/main/k8s-yaml/monitoring |
本文用到组件如下:
- Prometheus 安装不做赘述
- blackbox-exporter
- alertmanager
- dingtalk
一、Prometheus安装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
vim monitoring/prometheus/prometheus-k8s-rulefiles.yaml 1. 添加监控域名告警规则 domain-monitor.yaml: | groups: - name: black-box rules: - alert: SSL证书到期预警 expr: (probe_ssl_earliest_cert_expiry-time())/86400 < 10 for: 5m labels: severity: critical annotations: description: '域名: {{ $labels.instance }} , 剩余有效天数: {{$value}}' summary: 'SSL Warning' - alert: 域名无法访问 expr: probe_success < 1 for: 1m labels: severity: critical annotations: description: '域名: {{ $labels.instance }} , 无法访问' summary: 'Domain Warning' |
1、查看rule文件(可依据自己需求修改提前预警天数)
也可使用yaml方式添加 black-box.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: ssl-rules namespace: monitoring spec: groups: - name: black-box rules: - alert: SSL证书到期预警 expr: (probe_ssl_earliest_cert_expiry-time())/86400 < 14 for: 1d labels: severity: critical annotations: description: '域名: {{ $labels.instance }} , 剩余有效天数: {{$value}}' summary: 'SSL Warning' - alert: 域名无法访问 expr: probe_success < 1 for: 1m labels: severity: critical annotations: description: '域名: {{ $labels.instance }} , 无法访问' summary: 'Domain Warning' |
1 2 3 4 5 6 7 8 9 10 11 |
alert:告警规则的名称。 expr:基于PromQL 表达式告警触发条件,用于计算是否有时间序列满足该条件。 for:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。(例如某个告警状态持续了1分钟 才报警,某个告警状态持续了5m才报警) labels:自定义标签,允许用户指定要附加到告警上的一组附加标签。 annotations:用于指定一组附加信息,比如用于描述告警详细信息的文字等,annotations 的内容在告警产生时会一同作为参数发送到Alertmanager。 summary 描述告警的概要信息,description 用于描述告警的详细信息。 同时Alertmanager 的UI 也会根据这两个标签值,显示告警信息。 |
状态说明Prometheus Alert 告警状态有三种状态:Inactive、Pending、Firing
1 2 3 4 |
1、Inactive:非活动状态,表示正在监控,但是还未有任何警报触发。 2、Pending:表示这个警报必须被触发。由于警报可以被分组、压抑/抑制或静默/静音,所以等待验证,一旦所有的验证都通过,则将转到Firing 状态。 3、Firing:将警报发送到AlertManager,它将按照配置将警报的发送给所有接收者。一旦处理完警报后,警报解除,则将状态转到Inactive,如此循环。 |
2、修改configmap 设置你要监控的域名(blackbox-exporter-files-discover.yaml)
1 2 3 4 5 6 |
大佬配置文件分为3个协议类型分别是 1. http/https (一般域名可访问性和ssl用这个足矣) 2. Tcp 3. Icmp 注意:依据自己需求设置就好,本人监控证书则只需要1 |
3、执行所有yaml 创建prometheus
1 2 3 4 |
kubectl apply -f prometheus/* #访问 http://prometheus.tycng.com/ |
二、安装blackbox
1 |
Blackbox黑盒程序是prometheus用于tcp,http,icmp等协议监控站点存活的工具。 |
1 2 |
cd blackbox-exporter/ kubectl apply -f . |
访问:http://blackbox.shooter.com/
三、部署alertmanager
1 |
Alertmanager 主要用于接收 Prometheus 发送的告警信息,它支持丰富的告警通知渠道,而且很容易做到告警信息进行去重,降噪,分组,策略路由,是一款前卫的告警通知系统。 |
1. 修改告警配置文件secret
1 2 3 |
cd alertmanager alertmanager-secret.yaml 注意:新版本的alertmanager可能不支持下面明文这种写法,建议base64 详情参考: https://www.yoyoask.com/?p=9777 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
apiVersion: v1 kind: Secret metadata: name: alertmanager-main namespace: monitoring stringData: alertmanager.yaml: |- "global": "resolve_timeout": "5m" "inhibit_rules": - "source_match": "severity": "critical" "target_match": "severity": "warning" "equal": "['alertname', 'cluster', 'service']" "receivers": - "name": "webhook" "webhook_configs" - "url": "'http://localhost:8060/dingtalk/webhook/send'" "send_resolved": "true" - "name": "Critical" "route": "group_by": "['alertname', 'cluster', 'service']" "group_interval": "5m" "group_wait": "30s" "receiver": "webhook" "repeat_interval": "3h" "routes": - "match_re": "service": "^(blackbox-exporter|grafana|etcd)$" "receiver": "webhook" "routes": - "match": "severity": "critical" "receiver": "webhook" type: Opaque |
1 2 |
#部署 kubectl apply -f . |
查看grafana
四、安装dingtalk告警插件
官方地址(如果要自己打镜像的可以去这里)
1 |
git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git |
这个git拉取的代码,需要先编译,然后再打包成镜像才可以使用。编译方式请戳这里
1、安装dingtalk服务
1 |
webhook-dingtalk.yaml |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
apiVersion: apps/v1 kind: Deployment metadata: labels: app: webhook-dingtalk name: webhook-dingtalk namespace: monitoring spec: replicas: 1 selector: matchLabels: app: webhook-dingtalk template: metadata: labels: app: webhook-dingtalk spec: imagePullSecrets: - name: aliyun-registry containers: - image: juestnow/prometheus-webhook-dingtalk:2.1.0 imagePullPolicy: IfNotPresent name: webhook-dingtalk args: - "--config.file=/etc/prometheus-webhook-dingtalk/config.yml" volumeMounts: - name: webdingtalk-configmap mountPath: /etc/prometheus-webhook-dingtalk/config.yml subPath: config.yml - name: webdingtalk-template mountPath: /etc/prometheus-webhook-dingtalk/templates/default.tmpl subPath: default.tmpl ports: - containerPort: 8060 protocol: TCP resources: limits: cpu: 1000m memory: 1024Mi volumes: - name: webdingtalk-configmap configMap: name: dingtalk-config - name: webdingtalk-template configMap: name: dingtalk-template --- apiVersion: v1 kind: Service metadata: labels: app: webhook-dingtalk name: webhook-dingtalk namespace: monitoring spec: ports: - name: http port: 8060 protocol: TCP targetPort: 8060 selector: app: webhook-dingtalk type: ClusterIP |
2、创建告警模板配置文件configmap
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
apiVersion: v1 kind: ConfigMap metadata: name: dingtalk-template namespace: monitoring labels: app: dingtalk-template data: default.tmpl: | {{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }} {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }} {{ define "__text_alert_list" }}{{ range . }} **Labels** {{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Annotations** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }}) {{ end }}{{ end }} {{ define "default.__text_alert_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{ define "default.__text_alertresovle_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{/* Default */}} {{ define "default.title" }}{{ template "__subject" . }}{{ end }} {{ define "default.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ if gt (len .Alerts.Firing) 0 -}} ![警报 图标](https://images.gitee.com/uploads/images/2022/0523/165317_9d5104b4_1042431.jpeg) **====侦测到故障====** {{ template "default.__text_alert_list" .Alerts.Firing }} {{- end }} {{ if gt (len .Alerts.Resolved) 0 -}} {{ template "default.__text_alertresovle_list" .Alerts.Resolved }} {{- end }} {{- end }} {{/* Legacy */}} {{ define "legacy.title" }}{{ template "__subject" . }}{{ end }} {{ define "legacy.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ template "__text_alert_list" .Alerts.Firing }} {{- end }} {{/* Following names for compatibility */}} {{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }} {{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }} |
3.创建你的钉钉webhook token和secret配置文件
1 |
dingtalk-configmap.yaml |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
apiVersion: v1 kind: ConfigMap metadata: name: dingtalk-config namespace: monitoring labels: app: dingtalk-config data: config.yml: | templates: - /etc/prometheus-webhook-dingtalk/templates/default.tmpl targets: webhook: url: https://oapi.dingtalk.com/robot/send?access_token=fb35b6a7160c412bb402eeb16356fexxxxxxxxxxxxxxxxxxxxx secret: SECa2cce3925e5b1d40dd88d9fa6dcddxxxxxxxxxxxxxxxxxxx message: title: '{{ template "ding.link.title" . }}' text: '{{ template "ding.link.content" . }}' |
4.应用
1 |
kubectl apply -f . |
5.查看服务
6.查看日志(刚才有个blackbox.shooter.com域名离线将会产生告警)
7.查看钉钉告警
1 |
(完成) |
- 本文固定链接: https://www.yoyoask.com/?p=8588
- 转载请注明: shooter 于 SHOOTER 发表