1.Alertmanager 安装
<1>. 下载Alertmanager
1 2 3 4 5 |
官网:alertmanager官方网站下载 https://prometheus.io/download/#alertmanager wget https://github.com/prometheus/alertmanager/releases/download/v0.17.0/alertmanager-0.17.0.linux-amd64.tar.gz tar xvf alertmanager-0.17.0.linux-amd64.tar.gz mv alertmanager-0.17.0.linux-amd64 /opt/alertmanager |
<2>. 启动Alertmanager
1 2 3 4 |
linux 中启动alertmanager命令 cd进入到alertmanager根目录 cd /usr/local/alertmanager ./alertmanager --config.file=alertmanager.yml |
<3>.创建Systemd服务启动alertmanager
1 |
vim /etc/systemd/system/alertmanager.service |
1 2 3 4 5 6 7 8 9 10 |
[Unit] Description=alertmanager Documentation=https://prometheus.io/download/#alertmanager After=network.target [Service] Type=simple ExecStart=/opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml Restart=on-failure [Install] WantedBy=multi-user.target |
启动
1 2 3 |
systemctl start alertmanager systemctl status alertmanager systemctl enable alertmanager |
也可以nohup启动
1 |
nohup /opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml & |
<4>. 查看Alertmanager运行状态
1 |
Alertmanager启动后可以通过9093端口访问,http://ip:9093 |
<5>. Prometheus中配置Alertmanager
修改prometheus.yml,配置alertmanager的地址。
1 2 3 4 |
alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] |
配置后重启prometheus
1 2 |
systemctl daemon-reload systemctl restart prometheus |
至此。prometheus产生告警后将会发送给 Alertmanager 。
2.安装prometheus-webhook-dingtalk
prometheus-webhook-dingtalk, version 1.4.0 安装记录
1 2 3 4 5 |
prometheus-webhook-dingtalk,是对alertmanager告警的一个扩展,支持钉钉,微信,邮件告警和自建告警模板 最新版将告警方式的添加集成到配置文件中,方便添加与修改 下面就讲一下安装最新版本prometheus-webhookdingtalk 安装与踩到的坑 prometheus-webhook-dingtalk原文地址(实践中有所修正): https://blog.csdn.net/buster_zr/article/details/105848811 |
<1>.获取 prometheus-webhook-dingtalk 安装包
1 |
git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git |
<2>.解决依赖
(1)yarn
1 2 3 |
curl --silent --location https://dl.yarnpkg.com/rpm/yarn.repo | sudo tee /etc/yum.repos.d/yarn.repo curl --silent --location https://rpm.nodesource.com/setup_8.x | sudo bash - yum install yarn |
(2)go 环境依赖
1 2 3 4 |
wget https://dl.google.com/go/go1.14.2.linux-amd64.tar.gz tar zxf go1.14.2.linux-amd64.tar.gz -C /usr/local/ echo "export PATH=$PATH:/usr/local/go/bin" >>~/.bashrc source ~/.bashrc |
(3)node.js 版本
node.js 要求版本大于等于10
1 2 3 4 5 |
wget https://nodejs.org/dist/v12.16.1/node-v12.16.1-linux-x64.tar.xz 下载node.js 新版本 tar xf node-v12.16.1-linux-x64.tar.xz -C /usr/local/ mv /usr/bin/node /usr/bin/node_bakv6.17.1 ln -s /usr/local/node-v12.16.1-linux-x64/bin/node /usr/bin/node |
<3>.进行编译安装
1 2 |
cd prometheus-webhook-dingtalk make build |
编译完成后进行如下操作
1 2 3 4 5 6 7 8 9 10 |
mkdir -p /opt/prometheus-webhook-dingtalk/template/ 1.将编译好的prometheus-webhook-dingtalk 复制到指定目录 cp prometheus-webhook-dingtalk /opt/prometheus-webhook-dingtalk/ 2.复制默认模板文件 cp template/default.tmpl /opt/prometheus-webhook-dingtalk/template/ 3.复制配置文件 cp config.yml /opt/prometheus-webhook-dingtalk/ 4.设置软链接 ln -s /opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk /usr/local/bin/ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
#查看支持参数 /opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --help usage: prometheus-webhook-dingtalk [<flags>] Flags: -h, --help Show context-sensitive help (also try --help-long and --help-man). --web.listen-address=":8060" The address to listen on for web interface. --web.enable-ui Enable Web UI mounted on /ui path --web.enable-lifecycle Enable reload via HTTP request. --config.file="config.yml" Path to the configuration file. --log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error] --log.format=logfmt Output format of log messages. One of: [logfmt, json] --version Show application version. |
踩坑备注
1 2 3 4 5 6 |
进行告警模板的指定时,不要按照老版本通过 --template.file= 进行指定,会报错, 只会识别tempaltefile 而不去识别指定dingdingwebhook的 配置文件 新版本将 tempaltefile 指定集成到了 配置文件中,只需要在配置文件中指定template的位置就行了 #templates: # - contrib/templates/legacy/template.tmpl |
<4>.配置prometheus-webhook-dingtalk
(1).钉钉创建机器人自定义告警关键词并获取token 和 secret(加签)不必设置关键词,三种方式任选一种即可
1 2 3 4 5 6 |
选择群组—>群设置–>添加智能群助手–>添加机器人 注意:选择过程中会有三种安全设置(这里我们只用第一种) 1.第一个自定义关键字是说你在以后发送的文字中必须要有这个关键字,否则发送不成功。 2.加签是一种特殊的加密方式,第一步,把timestamp+"\n"+密钥当做签名字符串,使用HmacSHA256算法计算签名,然后进行Base64 encode,最后再把签名参数再进行urlEncode,得到最终的签名(需要使用UTF-8字符集)。 3.IP地址就是说你在发送时会获取你的IP地址,如果不匹配就发送不成功。这个加密的方式可以自己选择,我们选择加签。如果你想使用IP的话,可以访问https://ip.cn/ |
(2)创建配置文件config.yml (创建之前备份下默认的配置文件)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
## Request timeout # timeout: 5s ## Uncomment following line in order to write template from scratch (be careful!) #no_builtin_template: true ## Customizable templates path #templates: # - contrib/templates/legacy/template.tmpl #消息模板文件 templates: - /opt/prometheus-webhook-dingtalk/template/template.tmpl ## You can also override default template using `default_message` ## The following example to use the 'legacy' template from v0.3.0 #default_message: # title: '{{ template "legacy.title" . }}' # text: '{{ template "legacy.content" . }}' ## Targets, previously was known as "profiles" targets: webhook: url: https://oapi.dingtalk.com/robot/send?access_token=0df42dc863ec08274b secret: SECcef7ffa8990cdd29b9d0cbe5c08b121c message: title: '{{ template "ding.link.title" . }}' text: '{{ template "ding.link.content" . }}' |
(3).创建模板文件 (默认模板是default.tmpl,我这里自行创建一个)
1 |
vim /opt/prometheus-webhook-dingtalk/template/template.tmpl |
模板1号
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }} {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }} {{ define "__text_alert_list" }}{{ range . }} **Labels** {{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Annotations** {{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }}) {{ end }}{{ end }} {{ define "default.__text_alert_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }} - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{ define "default.__text_alertresovle_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }} - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{/* Default */}} {{ define "default.title" }}{{ template "__subject" . }}{{ end }} {{ define "default.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ if gt (len .Alerts.Firing) 0 -}} {{ template "default.__text_alert_list" .Alerts.Firing }} {{- end }} {{ if gt (len .Alerts.Resolved) 0 -}} {{ template "default.__text_alertresovle_list" .Alerts.Resolved }} {{- end }} {{- end }} {{/* Legacy */}} {{ define "legacy.title" }}{{ template "__subject" . }}{{ end }} {{ define "legacy.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ template "__text_alert_list" .Alerts.Firing }} {{- end }} {{/* Following names for compatibility */}} {{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }} {{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }} |
模板2号
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }} {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }} {{ define "__text_alert_list" }}{{ range . }} **Labels** {{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Annotations** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }}) {{ end }}{{ end }} {{ define "default.__text_alert_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{ define "default.__text_alertresovle_list" }}{{ range . }} --- **告警级别:** {{ .Labels.severity | upper }} **运营团队:** {{ .Labels.team | upper }} **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }} **事件信息:** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **事件标签:** {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }}{{ end }} {{ end }} {{ end }} {{/* Default */}} {{ define "default.title" }}{{ template "__subject" . }}{{ end }} {{ define "default.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ if gt (len .Alerts.Firing) 0 -}} ![警报 图标](https://ss0.bdstatic.com/70cFuHSh_Q1YnxGkpoWK1HF6hhy/it/u=3626076420,1196179712&fm=15&gp=0.jpg) **====侦测到故障====** {{ template "default.__text_alert_list" .Alerts.Firing }} {{- end }} {{ if gt (len .Alerts.Resolved) 0 -}} {{ template "default.__text_alertresovle_list" .Alerts.Resolved }} {{- end }} {{- end }} {{/* Legacy */}} {{ define "legacy.title" }}{{ template "__subject" . }}{{ end }} {{ define "legacy.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ template "__text_alert_list" .Alerts.Firing }} {{- end }} {{/* Following names for compatibility */}} {{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }} {{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }} |
<5>. 配置prometheus-webhook-dingtalk启动文件
1 |
vim /etc/systemd/system/prometheus-webhook-dingtalk.service |
1 2 3 4 5 6 7 8 9 10 |
[Unit] Description=prometheus-webhook-dingtalk After=network-online.target [Service] Restart=on-failure ExecStart=/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/opt/prometheus-webhook-dingtalk/config.yml [Install] WantedBy=multi-user.target |
1 2 3 4 5 |
#命令行启动 systemctl daemon-reload systemctl start prometheus-webhook-dingtalk systemctl enable prometheus-webhook-dingtalk ss -tnl | grep 8060 |
<6>.配置altermanager配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
global: resolve_timeout: 5m #每5分钟检查一次状态是否恢复 route: group_by: ['alertname','cluster'] #采用哪个标签来作为分组依据 group_wait: 1s #组报警等待时间 group_interval: 1s #组报警间隔时间 repeat_interval: 1s #重复报警间隔时间 receiver: 'web.hook' #配置告警消息接受者信息,例如常用的 email、wechat、slack、webhook 等消息通知方式。 receivers: - name: 'web.hook' webhook_configs: - send_resolved: true #警报被解决之后是否通知 url: 'http://localhost:8060/dingtalk/webhook/send' inhibit_rules: #抑制规则配置,当存在与另一组匹配的警报(源)时,抑制规则将禁用与一组匹配的警报(目标)。 - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] |
配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - send_resolved: true url: 'http://localhost:8060/dingtalk/webhook/send' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] |
重启alertmanager
1 2 3 4 5 6 7 |
#命令行启动 systemctl restart alertmanager systemctl status alertmanager #或者kill掉重新nuhup nohup /opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml & #查看是否启动成功 netstat -anput | grep 9093 |
curl测试发信到钉钉(复制下面第二第三项)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
#先传统模式测试一下是否能收到消息 curl 'https://oapi.dingtalk.com/robot/send?access_token=0df42dc863ec08274b3f3226ca1fc6cd3a85564343' \ -H 'Content-Type: application/json' \ -d '{"msgtype": "text", "text": { "content": "shooter钉钉机器人群消息测试" } }' #测试prometheus-webhook-dingtalk curl 'http://localhost:8060/dingtalk/webhook/send' \ -H 'Content-Type: application/json' \ -d '{"msgtype": "text", "text": { "content": "shooter钉钉机器人群消息测试" } }' curl 'http://localhost:8060/dingtalk/webhook/send' \ -H 'Content-Type: application/json' \ -d '{"msgtype": "ding.link.text","text": {"ding.link.content": "'"咸鱼我来了"'"}}' curl 'http://[Podip]:8060/dingtalk/webhook/send' \ -H 'Content-Type: application/json' \ -d '{"msgtype": "ding.link.text","text": {"ding.link.content": "'"咸鱼我来了"'"}}' |
钉钉接收到消息说明成功了。(先不管消息为空的问题,这是因为接收参数问题)
3.接下来配置prometheus告警规则
修改prometheus的配置文件prometheus.yml
1 2 |
rule_files: - "/opt/prometheus/rules/*.yml" |
在rules目录中添加.yml结尾的规则文件,prometheus会根据这些规则配置文件进行监控报警。
相关监控模版:
模板参数大致解析:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# 一个配置文件里包含多个组 groups: - name: example #组名 rules: # 触发规则列表 - alert: HighErrorRate # 警告名 expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 # 触发规则 for: 10m # 规则触发持续多长时间发送告警 # 告警附加标签 labels: #自定义标签 severity: page #警告级别 warrning 之类 # 告警附加注释 annotations: summary: High request latency |
node_alived.yml ( 实例存活告警规则 )
1 2 3 4 5 6 7 8 9 10 11 12 |
groups: - name: 实例存活告警规则 rules: - alert: 实例存活告警 expr: up == 0 for: 1m labels: user: prometheus severity: warning annotations: summary: "主机宕机 !!!" description: "该实例主机已经宕机超过一分钟了。" |
memory_over.yml (内存报警规则)
1 2 3 4 5 6 7 8 9 10 11 |
groups: - name: 内存报警规则 rules: - alert: 内存使用率告警 expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 50 for: 1m labels: severity: warning annotations: summary: "服务器可用内存不足。" description: "内存使用率已超过50%(当前值:{{ $value }}%)" |
cpu_over.yml ( CPU报警规则 )
1 2 3 4 5 6 7 8 9 10 11 |
groups: - name: CPU报警规则 rules: - alert: CPU使用率告警 expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 50 for: 1m labels: severity: warning annotations: summary: "CPU使用率正在飙升。" description: "CPU使用率超过50%(当前值:{{ $value }}%)" |
disk_over.yml (磁盘使用率报警规则 )
1 2 3 4 5 6 7 8 9 10 11 |
groups: - name: 磁盘使用率报警规则 rules: - alert: 磁盘使用率告警 expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80 for: 20m labels: severity: warning annotations: summary: "硬盘分区使用率过高" description: "分区使用大于80%(当前值:{{ $value }}%)" |
1 |
systemctl restart prometheus |
1 2 |
热加载配置 curl -XPOST 127.0.0.1:9090/-/reload |
登陆prometheus的UI界面,查看Alerts规则
- Inactive:没有触发阈值
- Pending:已触发阈值但未满足告警持续时间
- Firing:已触发阈值且满足告警持续时间
这里说一下触发告警的原理:
1 2 3 4 5 6 7 8 |
prometheus 相当于zabbix的server,他也是个数据库,从node_export接收的数据都存在了prometheus servier当中。(prometheus.yml 配置文件当中,配置了要监控的主机,或者说要监控的job(工作项)) rules通过expr规则匹配的你数据库的查询,查询prometheus server数据库,通过你的查询语句判断超过你的设置,就会进行告警。 所以要告警哪台机器,或者某个job,都可以通过PromSQl 语句去查询。 PromSQL文档 https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/promql/prometheus-query-language |
- 本文固定链接: https://www.yoyoask.com/?p=4514
- 转载请注明: shooter 于 SHOOTER 发表