在现代的IT基础设施中,实时监控系统健康状况和性能指标变得至关重要。Prometheus作为一款流行的开源监控解决方案,提供了强大的时间序列数据库和灵活的查询语言,能够高效地收集和处理大量监控数据。然而,要充分利用Prometheus的能力,正确配置和集成各种数据源是关键。
Node Exporter是Prometheus生态系统中的一个重要组成部分,专门设计用于收集类UNIX系统的硬件和操作系统层面的指标,如CPU使用率、内存使用情况、磁盘I/O和网络统计等。通过部署Node Exporter,Prometheus能够直接从目标机器上拉取这些关键指标,从而实现对物理机、虚拟机或容器环境的全面监控。
部署node_exporter
[root@dev-centos7-shanghai-area1 packages]# wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
[root@dev-centos7-shanghai-area1 packages]# tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
[root@dev-centos7-shanghai-area1 packages]# mv node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
[root@dev-centos7-shanghai-area1 packages]# vim /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
[root@dev-centos7-shanghai-area1 packages]# systemctl daemon-reload
[root@dev-centos7-shanghai-area1 packages]# systemctl enable node_exporter.service --now
prometheus.yml中新增Node Exporter抓取指标
# prometheus.yml - Prometheus server v2.53.0 configuration file ,edit on 2024-07-11 09:10
global:
# 全局配置
scrape_interval: 15s # 定义抓取目标频率,单位是秒,配置每15秒抓取一次数据
evaluation_interval: 15s # 定义规则评估频率,单位是秒,配置每15秒评估一次报警规则
external_labels: # 添加外部标签,用于标识Prometheus实例
monitor: 'primary_monitor' # 可以自定义标签值
# 抓取配置,用于定义如何抓取监控目标的数据
scrape_configs:
- job_name: 'prometheus' # 更改为默认的job_name 'prometheus'
honor_labels: true # 尊重目标上的标签
static_configs:
- targets: ['localhost:9090'] # 目标列表,将抓取本地主机上运行在9090端口上的Prometheus服务器自身的指标
- job_name: 'nginx-exporter'
honor_labels: true
static_configs:
- targets: ['203.0.113.128:9113'] # 目标列表,抓取远程主机上运行在9113端口上的Nginx Prometheus Exporter的指标
- job_name: 'node-exporter'
honor_labels: true
static_configs:
- targets: ['198.51.100.128:9100'] # 目标列表,抓取运行Node Exporter的远程主机上暴露的指标
# 规则文件配置,用于加载自定义的告警规则
rule_files:
- "/etc/prometheus/rules/*.yml" # 使用通配符匹配所有yml规则文件
# 配置Alertmanager,Prometheus将警报发送给Alertmanager
alerting:
alertmanagers:
- static_configs:
- targets:
- "localhost:9093" # Alertmanager 在同一台主机上运行,并且监听在默认端口 9093
问题记录
修改完prometheus.yml后发现服务不能正常启动,手动运行 /usr/local/prometheus/prometheus –config.file /etc/prometheus/prometheus.yml –storage.tsdb.path /var/lib/prometheus/ –web.console.libraries /usr/local/prometheus/consoles/ –web.console.templates /usr/local/prometheus/consoles/ 是没有问题的,可以启动prometheus
[root@dev-centos7-shanghai-area1 ~]# systemctl restart prometheus.service
[root@dev-centos7-shanghai-area1 ~]# systemctl status prometheus.service
● prometheus.service - Prometheus
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since 四 2024-07-11 10:19:44 CST; 4s ago
Process: 3680 ExecStart=/usr/local/prometheus/prometheus --config.file /etc/prometheus/prometheus.yml --storage.tsdb.path /var/lib/prometheus/ --web.console.libraries /usr/local/prometheus/consoles/ --web.console.templates /usr/local/prometheus/consoles/ (code=exited, status=2)
Main PID: 3680 (code=exited, status=2)
7月 11 10:19:44 dev-centos7-shanghai-area1 systemd[1]: Unit prometheus.service entered failed state.
7月 11 10:19:44 dev-centos7-shanghai-area1 systemd[1]: prometheus.service failed.
借助journalctl进行分析:
journalctl -u prometheus.service 中 遇到 7月 11 10:24:17 dev-centos7-shanghai-area1 prometheus[4123]: ts=2024-07-11T02:24:17.431Z caller=query_logger.go:114 level=error component=activeQueryTracker msg="Error opening query log file" file=/var/lib/prometheus/queries.active err="open /var/lib/prometheus/querie 7月 11 10:24:17 dev-centos7-shanghai-area1 systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT 7月 11 10:24:17 dev-centos7-shanghai-area1 systemd[1]: Unit prometheus.service entered failed state. 7月 11 10:24:17 dev-centos7-shanghai-area1 systemd[1]: prometheus.service failed.
分析结果:看到Prometheus试图打开/var/lib/prometheus/queries.active文件,由于权限不足无法完成操作,这个错误是由于我刚才在调试的时候清理了TSDB数据库,然后启动Prometheus的时候创建一个新的数据库,没有注意权限。
解决步骤:
[root@dev-centos7-shanghai-area1 ~]# chown -R prometheus:prometheus /var/lib/prometheus/
[root@dev-centos7-shanghai-area1 ~]# chmod -R 755 /var/lib/prometheus/
[root@dev-centos7-shanghai-area1 ~]# systemctl daemon-reload
[root@dev-centos7-shanghai-area1 ~]# systemctl restart prometheus.service