Prometheus监控生态探索:Node Exporter数据抓取实战指南

在现代的IT基础设施中,实时监控系统健康状况和性能指标变得至关重要。Prometheus作为一款流行的开源监控解决方案,提供了强大的时间序列数据库和灵活的查询语言,能够高效地收集和处理大量监控数据。然而,要充分利用Prometheus的能力,正确配置和集成各种数据源是关键。

Node Exporter是Prometheus生态系统中的一个重要组成部分,专门设计用于收集类UNIX系统的硬件和操作系统层面的指标,如CPU使用率、内存使用情况、磁盘I/O和网络统计等。通过部署Node Exporter,Prometheus能够直接从目标机器上拉取这些关键指标,从而实现对物理机、虚拟机或容器环境的全面监控。


部署node_exporter

[root@dev-centos7-shanghai-area1 packages]# wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
[root@dev-centos7-shanghai-area1 packages]# tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
[root@dev-centos7-shanghai-area1 packages]# mv node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
[root@dev-centos7-shanghai-area1 packages]# vim /etc/systemd/system/node_exporter.service

   [Unit]
   Description=Node Exporter
   Wants=network-online.target
   After=network-online.target

   [Service]
   Type=simple
   User=root
   ExecStart=/usr/local/bin/node_exporter
   Restart=on-failure

   [Install]
   WantedBy=multi-user.target
[root@dev-centos7-shanghai-area1 packages]# systemctl daemon-reload
[root@dev-centos7-shanghai-area1 packages]# systemctl enable node_exporter.service --now

prometheus.yml中新增Node Exporter抓取指标

# prometheus.yml - Prometheus server v2.53.0 configuration file ,edit on 2024-07-11 09:10
global:
  # 全局配置
  scrape_interval:     15s # 定义抓取目标频率,单位是秒,配置每15秒抓取一次数据
  evaluation_interval: 15s # 定义规则评估频率,单位是秒,配置每15秒评估一次报警规则
  external_labels:     # 添加外部标签,用于标识Prometheus实例
    monitor: 'primary_monitor' # 可以自定义标签值

# 抓取配置,用于定义如何抓取监控目标的数据
scrape_configs:
  - job_name: 'prometheus' # 更改为默认的job_name 'prometheus'
    honor_labels: true # 尊重目标上的标签
    static_configs:
      - targets: ['localhost:9090'] # 目标列表,将抓取本地主机上运行在9090端口上的Prometheus服务器自身的指标

  - job_name: 'nginx-exporter'
    honor_labels: true
    static_configs:
      - targets: ['203.0.113.128:9113'] # 目标列表,抓取远程主机上运行在9113端口上的Nginx Prometheus Exporter的指标

  - job_name: 'node-exporter'
    honor_labels: true
    static_configs:
      - targets: ['198.51.100.128:9100'] # 目标列表,抓取运行Node Exporter的远程主机上暴露的指标

# 规则文件配置,用于加载自定义的告警规则
rule_files:
  - "/etc/prometheus/rules/*.yml" # 使用通配符匹配所有yml规则文件

# 配置Alertmanager,Prometheus将警报发送给Alertmanager
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - "localhost:9093" # Alertmanager 在同一台主机上运行,并且监听在默认端口 9093

问题记录

修改完prometheus.yml后发现服务不能正常启动,手动运行 /usr/local/prometheus/prometheus –config.file /etc/prometheus/prometheus.yml –storage.tsdb.path /var/lib/prometheus/ –web.console.libraries /usr/local/prometheus/consoles/ –web.console.templates /usr/local/prometheus/consoles/ 是没有问题的,可以启动prometheus

[root@dev-centos7-shanghai-area1 ~]# systemctl restart prometheus.service 
[root@dev-centos7-shanghai-area1 ~]# systemctl status prometheus.service 
● prometheus.service - Prometheus
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since 四 2024-07-11 10:19:44 CST; 4s ago
  Process: 3680 ExecStart=/usr/local/prometheus/prometheus --config.file /etc/prometheus/prometheus.yml --storage.tsdb.path /var/lib/prometheus/ --web.console.libraries /usr/local/prometheus/consoles/ --web.console.templates /usr/local/prometheus/consoles/ (code=exited, status=2)
 Main PID: 3680 (code=exited, status=2)

7月 11 10:19:44 dev-centos7-shanghai-area1 systemd[1]: Unit prometheus.service entered failed state.
7月 11 10:19:44 dev-centos7-shanghai-area1 systemd[1]: prometheus.service failed.

借助journalctl进行分析:

journalctl -u prometheus.service 中 遇到 7月 11 10:24:17 dev-centos7-shanghai-area1 prometheus[4123]: ts=2024-07-11T02:24:17.431Z caller=query_logger.go:114 level=error component=activeQueryTracker msg="Error opening query log file" file=/var/lib/prometheus/queries.active err="open /var/lib/prometheus/querie 7月 11 10:24:17 dev-centos7-shanghai-area1 systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT 7月 11 10:24:17 dev-centos7-shanghai-area1 systemd[1]: Unit prometheus.service entered failed state. 7月 11 10:24:17 dev-centos7-shanghai-area1 systemd[1]: prometheus.service failed.

分析结果:看到Prometheus试图打开/var/lib/prometheus/queries.active文件,由于权限不足无法完成操作,这个错误是由于我刚才在调试的时候清理了TSDB数据库,然后启动Prometheus的时候创建一个新的数据库,没有注意权限。

解决步骤:

[root@dev-centos7-shanghai-area1 ~]# chown -R prometheus:prometheus /var/lib/prometheus/
[root@dev-centos7-shanghai-area1 ~]# chmod -R 755 /var/lib/prometheus/
[root@dev-centos7-shanghai-area1 ~]# systemctl daemon-reload
[root@dev-centos7-shanghai-area1 ~]# systemctl restart prometheus.service

上一篇
下一篇