Prometheus

Linux、Python、 Golang、Java、DevOps、前端、English 2023-03-28 原文

前言

运维工程师的3大核心职能：服务器资源管理、变更管理、故障管理；

目前维护一些云原生项目，这些项目采用K8s部署，相较于传统的监控，Kubernetes云监控会面临以下棘手问题；

容器的封闭性、隔离性
容器的动态调度
容器网络的虚拟化、软件定义网络

我想通过1款监控系统对

物理服务器层
系统层
网络层
K8s集群层
运行在K8s集群之上的基础设施类应用+业务类应用（应用层）

进行全方位无死角监控；

一款监控系统软件应具备以下核心功能

数据采集：通过pull/push的方式采集数据
数据存储：SQL、NoSQL(K/V、Document、Colum列式存储、TSDB时序数据库)
展示：Grafana
告警：通过各种媒介（E-maill、短信、微信、钉钉）通知到报警接收人

一、Prometheus监控系统概述

Prometheus是前Goole工程师模仿Goole的Brogmon监控系统，而开发出来开源监控系统；

Kubernets是参考Goole的Borg系统开发的容器编排工具，所以Prometheus更适用于监控Kubernets集群；

Prometheus的优势

Metric有独创的指标格式
多维度标签，每个独立的标签组合都代表1个独立的时间序列
内建时序数据的聚合、切割、切片功能
支持双精度浮点型数据，但无法存储日志
被监控端自动发现
数据时序存储：Prometheus内置TSDB，通过PromQL进行TSDB的查询

Prometheus的劣势

无法存储日志，使用Lock或者ELK/EFK收集日志；
Prometheus内建的时序(Time Series)数据库只能存储1个月历史数据；

二、Prometheus监控系统组成

PrometheusServer

PrometheusServer无需配置，既可自动发现待监控的目标对象Target；

PrometheusServer的Retrieval（监控信息采集器）只能通过Pull的方式，从被监控端拉取监控数据；

程序自建Prometheus仪表盘(Instrumentation)

任何能支持Prometheus去Scrape指标数据的应用程序都首先具备1个测量系统；

在Prometheus的语境中，Instrumentation是指附加到应用程序内部，用于暴露程序指标的客户端库；

from prometheus_client import Gauge, Counter, Histogram, Summary  # 1.Prometheus指标类型
from prometheus_client.core import CollectorRegistry

程序员借助这些客户端库编写代码，生产可暴露的指标数据；

Prometheus主要通过3种类型的途径，使用HTTP协议，从Targart上抓取指标数据

Exporter：　　　针对没有集成Instrumentation的应用程序，Prometheus社区提供Exporter
Instrumentation：程序在开发时在代码中内建了Instrumentation（仪表板）功能；
Pushgateway：临时任务Push监控指标

其中

基础设施类应用监控：将Prometheus提供Expoerter，例如MySQL/Redis/Nginx.......都可以从Prometheus社区获取相应的Exporter安装到对应Target上；

业务类应用的监控（应用程序内置Prometheus仪表板）：将Prometheus提供Instrumentation集成到程序代码中；

import prometheus_client
from prometheus_client import Gauge, Counter, Histogram, Summary  # 1.Prometheus指标类型
from prometheus_client.core import CollectorRegistry
from flask import Response, Flask

app = Flask(__name__)
REGISTRY = CollectorRegistry(auto_describe=False)

#2.定义Prometheus指标的数据模型
cup_gauge = Gauge(
    "cpu_usage",      #指标名称（key）
    "统计CUP使用率",  # 指标说明
    ["core_number", "machine_ip"],  # Lable：同1指标可能会匹配到多个目标或者设备，因此使用标签作为指标的元数据可以为指标添加多维度的描述信息，这些Lable作为过滤器进行指标的过滤和聚合；
    registry=REGISTRY)


@app.route('/metrics')
def metrics():
    #3.Prometheus指标的数据模型填充数据
    cup_gauge.labels("1", "192.168.56.18").set(25)
    cup_gauge.labels("2", "192.168.56.18").set(25)
    cup_gauge.labels("3", "192.168.56.18").set(25)
    cup_gauge.labels("4", "192.168.56.18").set(25)

    cup_gauge.labels("1", "192.168.56.19").set(23)
    cup_gauge.labels("2", "192.168.56.19").set(27)
    cup_gauge.labels("3", "192.168.56.19").set(28)
    cup_gauge.labels("4", "192.168.56.19").set(22)

    cup_gauge.labels("1", "192.168.56.20").set(21)
    cup_gauge.labels("2", "192.168.56.20").set(29)
    cup_gauge.labels("3", "192.168.56.20").set(32)
    cup_gauge.labels("4", "192.168.56.20").set(28)

    return Response(prometheus_client.generate_latest(REGISTRY),
                    mimetype="text/plain")


@app.route('/')
def index():
    return "<h1>Customized Exporter</h1><br> <a href='metrics'>Metrics</a>"


if __name__ == "__main__":
    app.run(host='0.0.0.0', port=9100, debug=True)

"""
# HELP cpu_usage 统计CUP使用率
# TYPE cpu_usage gauge
cpu_usage{core_number="1",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="2",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="3",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="4",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="1",machine_ip="192.168.56.19"} 23.0
cpu_usage{core_number="2",machine_ip="192.168.56.19"} 27.0
cpu_usage{core_number="3",machine_ip="192.168.56.19"} 28.0
cpu_usage{core_number="4",machine_ip="192.168.56.19"} 22.0
cpu_usage{core_number="1",machine_ip="192.168.56.20"} 21.0
cpu_usage{core_number="2",machine_ip="192.168.56.20"} 29.0
cpu_usage{core_number="3",machine_ip="192.168.56.20"} 32.0
cpu_usage{core_number="4",machine_ip="192.168.56.20"} 28.0

"""

Python程序内建Prometheus仪表盘

Exporters

对于那些未内建Instrumentation，没有暴露Prometheus所支持格式的指标数据的应用程序来说，常用的监控方法是在待监控目标应用程序外，单独部署1个指标暴露程序，该类程序统称为Exporter；

换句话说Exporter负责从被监控的目标应用程序上主动采集和聚合原始的数据，并转换/聚合为Prometheus的指标数据格式

Prometheus社区提供了大量的Exporter例如Node Exporter、MogoDB Exporter、MySQL Exporter等；

PushGateway

如果Prometheus的被监控端是1些短期存在的应用程序，只能通过Push的方式推送报警，可以借助PushGateway组件；

被监控端推送（Push）监控数据到PushGateway，PrometheusServer的Retrieval（监控信息采集器）通过Pull的方式，从PushGateway拉取监控数据；

AlertManager

PrometheusServer只能获取报警指标但无法告警，当PrometheusServer抓取到异常值之后，Prometheus支持通过告警机制向用户发送反馈或者警示，已触发用户能及时采取应对措施；

PrometheusServer通过Push的方式，向AlertManager组件推送告警信息；

AlertManager再把各种告警信息，根据用户配置的告警路由，通过各种告警媒介，通知到各种告警接收人；

远端TSDB

数据就是金矿，但Prometheus内建的时序(Time Series)数据库只能存储1个月历史数据；

如果想要保存更长时间的历史监控数据可以使用远端的TSDB例如InfluxDB，对历史告警数据进行长期存储；

有了大量的历史告警数据以及日志，就可以通过机器学习技术，做告警趋势预测，最终达成智能运维目标；

三、Prometheus部署

1.下载相关组件

2.启动PrometheusServer

PrometheusServer程序是Go语言开发的，直接运行二进制文件即可；

PrometheusServer程序自建了仪表板（Instrumentation）工作在http://安装主机的外网IP:9090/metrics；

[root@localhost prometheus-2.40.2]# ls
console_libraries  consoles  data  LICENSE  NOTICE  prometheus  prometheus.yml  promtool
[root@localhost prometheus-2.40.2]# ./prometheus

3.启动node_exporter

把node_exporter部署在被监控端，貌似没有Windows系统的node_exporter，不过可以自己写1个；

启动node_exporter之后监听在http://被监控主机IP:9100/metrics，PrometheusServer会通过Htttp协议方式来Pull抓取数据；

[root@localhost node_exporter-1.4.0.linux-amd64]# ls
LICENSE  node_exporter  NOTICE
[root@localhost node_exporter-1.4.0.linux-amd64]# ./node_exporter 
ts=2022-11-18T10:22:19.410Z caller=node_exporter.go:182 level=info msg="Starting node_exporter" version="(version=1.4.0, branch=HEAD, revision=7da1321761b3b8dfc9e496e1a60e6a476fec6018)"
ts=2022-11-18T10:22:19.410Z caller=node_exporter.go:183 level=info msg="Build context" build_context="(go=go1.19.1, user=root@83d90983e89c, date=20220926-12:32:56)"
ts=2022-11-18T10:22:19.410Z caller=node_exporter.go:185 level=warn msg="Node Exporter is running as root user. This exporter is designed to run as unprivileged user, root is not required."
ts=2022-11-18T10:22:19.410Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
ts=2022-11-18T10:22:19.410Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2022-11-18T10:22:19.411Z caller=diskstats_common.go:100 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2022-11-18T10:22:19.411Z caller=node_exporter.go:108 level=info msg="Enabled collectors"
ts=2022-11-18T10:22:19.411Z caller=node_exporter.go:115 level=info collector=arp

4.prometheus.yml 配置

先通过静态配置的方式指定Prometheus去Pull哪些Target；

Prometheus会定时加载prometheus.yml配置文件，之后可以借助Eureka/Consul/Zookeeper这些配置管理中心，实现在不重启PrometheusServer的情况下，被监控端的服务自动发现；

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "zhanggen_nodes"
    metrics_path: "/metrics"
    static_configs:
      - targets:
        - "192.168.56.18:9100"

5.查看被监控的Targets

6.PromQL查询同1个Job下的报警

运维标准化是运维自动化的前提条件；

如果在配置时，系统层、K8s层、业务应用层，都使用同1个Lable，例如归类为1个Job，就可以把各层相关的报警内容串联起来发送；

四、Prometheus的指标

四种指标类型

Prometheus使用4种方法来描述监视的指标

Counter

计数器用于保存单调递增的数据，例如站点访问次数，不能为负值，也不支持减少，但可以重置为0；

Gauge

仪表盘用于存储有起伏特征的指标数据，例如内存空闲大小

Gauge是Counter的超集，相较于Counter存在指标数据丢失的可能性；

Counter能让用户确切了解指标随着时间的变化状态，而Gauge则可能随着时间流逝而变得精准度越来越低；

Histogram

直方图指标用于描述指标的分布情况，比如对于请求响应时间，总共10w个请求，小于10ms的有5w个，小于50ms的有2w个，小于100ms的有3w个；

Summary

和直方图类似，Summary也是用于描述指标分布情况，不过表现形式不同，比如还是对于请求响应时间，Summary描述则是，总共10w个请求，20%小于10ms，30%小于50ms，50%小于100ms；

指标数据格式

Prometheus仅用于以"键值"形式存储时序式的聚合数据，它并不支持存储文本数据；

# HELP cpu_usage 统计CUP使用率
# TYPE cpu_usage gauge
cpu_usage{core_number="1",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="2",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="3",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="4",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="1",machine_ip="192.168.56.19"} 23.0
cpu_usage{core_number="2",machine_ip="192.168.56.19"} 27.0
cpu_usage{core_number="3",machine_ip="192.168.56.19"} 28.0
cpu_usage{core_number="4",machine_ip="192.168.56.19"} 22.0
cpu_usage{core_number="1",machine_ip="192.168.56.20"} 21.0
cpu_usage{core_number="2",machine_ip="192.168.56.20"} 29.0
cpu_usage{core_number="3",machine_ip="192.168.56.20"} 32.0
cpu_usage{core_number="4",machine_ip="192.168.56.20"} 28.0

键

Key称为指标（Metric）名称，通常意味着CPU速率、内存使用率/分区比例等；

值

浮点型数值

五、PromQL

PromQL提供了内置的数据查询语句PromQL（Prometheus Query Language），支持用户进行实时的数据查询以及聚合操作；

PromQL支持处理2种向量，并内置了一组用于数据处理的函数；

既时向量：向量就是1个方向的数据，例如最近1次的时间戳上跟踪的数据指标（1列上的N行，1行上N列数据）；
时间范围向量：指定时间范围内所以时间戳上的数据指标（N列上的N行=矩阵）；

min_over_time(kube_pod_container_status_ready{namespace=~"aimaster-user-namespace.*",pod=~"app-workspace.*"}[4h]) == 1

六、告警策略

Alertmanager除了提供基本的告警通知能力以外，还主要提供了如：分组、抑制以及静默等告警特性：

配置实例

groups:
- name: test-mysql-rule
  rules:
  - alert: "连接数报警"
    expr: mysql_global_variables_mysqlx_max_connections > 90   #连接数大于90就告警PromQL
    for: 1s
    labels:
      severity: warning
    annotations:
      summary: "服务名:{{$labels.alertname}}"
      description: "业务msyql连接数不够报警: 当前值为：{{ $value }}"
      value: "{{ $value }}"

参考

有关Prometheus的更多相关文章

go - 如何在 golang 网络服务器中使用 prometheus 监控请求成本时间 - 2
我有几个url可以访问，我想用prometheus监控每个请求的成本时间。但我不知道使用哪种指标来收集数据。有什么帮助吗？这是演示代码:packagemainimport("github.com/prometheus/client_golang/prometheus""io/ioutil""net/http""fmt""time")var(resTime=prometheus.NewSummaryVec(prometheus.SummaryOpts{Name:"response_time",Help:"costtimeperrequest",},[]string{"costTime"}
go - Go 中 Prometheus 的多个端点 - 2
我目前正在开发一个由Prometheus监控的Go(golang)编写的程序。现在程序应该提供两个端点/metrics和/service。当Prometheus在/metrics上抓取时，它应该公开自己的指标(例如发出的请求、请求延迟等)，当在/service上抓取时，它应该查询一个API，从那里获取指标并将它们公开给Prometheus。对于第一部分，我创建了例如一个计数器通过requestCount:=kitprometheus.NewCounterFrom(stdprometheus.CounterOpts{Namespace:"SERVICE",Subsystem:"servi
go - Prometheus 指标未使用 Prometheus Go 客户端显示 - 2
我正在使用prometheusgolang客户端。代码片段如下。同样的构建工作正常。问题是只显示go指标。xyz_*指标缺失。我将initMetrics()作为main()函数中的第一件事。//Declaringprometheusmetriccountersvar(metric_prefix="xyz_"xyzAPICallsCounter=prometheus.NewCounterVec(prometheus.CounterOpts{Name:metric_prefix+"api_calls_total",Help:"Numberofcallstoxyzendpoint",},[]
go - Prometheus 使用计数器对非静态数据进行计数 - 2
我正在尝试计算唯一URI的数量并记录它们的数量。这些URI会随着时间的推移而变化，同一类型的URI可能有多个。例如，可以有多个“/foo”和“/bar”，并且可以进来一个新的URI——比方说“pooh”——我必须将它们添加到计数器并继续计数。在这种情况下，我不能使用常量标签。例如，如果我要按方法和/或状态代码计算http请求的数量，我可以这样做:httpRequestInfo:=prometheus.NewCounterVec(prometheus.CounterOpts{Name:"http_requests_sum",ConstLabels:prometheus.Labels{"c
go - Prometheus type Collector - 如何用我自己的数据提供 map - 2
免责声明:我是Golang的新手，之前没有用任何其他语言做过太多编程。不过，我仍然希望有人能为我指明正确的方向。目标是:根据PrometheusGolang模块(https://godoc.org/github.com/prometheus/client_golang/prometheus#Collector)下的“示例”部分以及提到“//仅示例假数据”的部分。当然是使用我自己的真实数据。我的数据以JSON格式来自RabbitMQ端点。我解析JSON，并且可以使用正确的键创建自己的映射:我需要的值作为funcmain()范围内的goroutine的一部分。假设我的map如下所示:[“设
go - Prometheus - 将指标列表发送到 Gauge - 2
我有一个要发送到prometheus的json格式的指标列表。我如何使用client_golang中的Guage指标类型将这些指标一次全部发送到prometheus？现在我有下面的代码var(dockerVer=prometheus.NewGauge(prometheus.GaugeOpts{Name:"docker_version_latency",Help:"Latencyofdockerversioncommand.",}))funcinit(){//Metricshavetoberegisteredtobeexposed:prometheus.MustRegister(dock
json - 如何使用 client_golang 在 prometheus 中提取指标 - 2
我正在尝试使用client_golang在GoLang中编写一个JSON导出器我找不到任何有用的例子。我有一个通过HTTP生成JSON输出的服务ABC。我想使用客户端golang将此指标导出到普罗米修斯。最佳答案看看Go客户端的godoc，它非常详细并且包含大量示例。Collector接口(interface)可能与此处最相关:https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Collector本质上，您将实现Collector接口(
go - Prometheus - 如何在指标中包含评论 - 2
我想向指标添加动态评论，以提供有关服务的一些信息。诸如操作系统版本、内核版本等之类的东西。如何将其包含在Guage度量类型中。我可以添加帮助和类型，但不能添加操作系统版本。**OSversionisxxxx**#HELPhttp_request_duration_secondsAhistogramoftherequestduration.#TYPEhttp_request_duration_secondshistogram 最佳答案无论您想在指标之上添加什么附加信息，请将其添加为维度(标签，用普罗米修斯的行话来说)。这样，可以在整
kafka集群搭建与prometheus监控配置 - 2
文章目录1、基于zookeeper的集群2、kafka集群安装2.1基于Zookeeper集群的配置2.2基于KRaft模式集群的配置2.3、启动Kafka集群3、kafka_exporter监控组件安装3.1、安装3.2、系统服务3.3、集成到prometheus4、与Grafana集成1、基于zookeeper的集群下载地址：https://zookeeper.apache.org/releases.html#downloadtar-zxvfzookeeper-3.4.11.tar.gz-C/usr/localcp/usr/local/zookeeper-3.4.11/conf/zoo_s
Prometheus+Grafana搭建Jmeter性能监控平台（2） - 2
三、安装Grafanadocker镜像3.1dockerpullgrafana/grafana$dockerpullgrafana/grafanaroot@docker-ubuntu:~#dockerpullgrafana/grafanaUsingdefaulttag:latestlatest:Pullingfromgrafana/grafana97518928ae5f:Pullcomplete5b58818b7f48:Pullcompleted9a64d9fd162:Pullcomplete4e368e1b924c:Pullcomplete867f7fdd92d9:Pullcomplete3