网站推广入口_表白网站制作软件手机_惠州网站seo_windows优化大师可靠吗

需求

在Grafana监控面板里，可看到部分服务的pod对CPU的使用高达3：
在这里插入图片描述
而该pod的部署脚本为：

spec:template:spec:containers:- name: data-fusion-serverresources:limits:cpu: '2'memory: 4Girequests:cpu: 10mmemory: 500MilivenessProbe:httpGet:path: /healthport: tcpscheme: HTTPinitialDelaySeconds: 30timeoutSeconds: 1periodSeconds: 5successThreshold: 1failureThreshold: 3readinessProbe:httpGet:path: /healthport: tcpscheme: HTTPinitialDelaySeconds: 30timeoutSeconds: 1periodSeconds: 5successThreshold: 1failureThreshold: 3startupProbe:httpGet:path: /healthport: tcpscheme: HTTPinitialDelaySeconds: 80timeoutSeconds: 1periodSeconds: 5successThreshold: 1failureThreshold: 3

也就是说，k8s限制pod可使用的CPU资源为2个。

pod使用的CPU资源对于k8s限制的CPU资源。

可简单分两种情况：

pod就是需要占用3个CPU资源，比如上面这个截图里，CPU使用率一直都非常稳定（截图有缩略，只截了2小时）。此时开发应该告诉运维，修改resources.limits.cpu，提高此配额。当然，并不是说，开发不需要去深入思考，为啥pod占用如此高的CPU，但是不排除部分pod就是要比其他pod使用更多的CPU。
pod绝大多数时间内，CPU使用率都比较低（假设是0.9个CPU）；因为某些暂时不清楚的异常情况，一段时间内（比如10分钟）突然飙升到使用3个CPU；经过10分钟后，CPU使用率还是迟迟不降下来。又或者10分钟后，CPU使用下降，此时需要判断是否有定时调度任务触发执行。

对于第二种情况，如果迟迟不下降，此时服务已经不正常，则大概率无法正常完成服务请求，如果没有正确的负载均衡策略，则打到此pod的用户请求将得不到正确的处理。

需求来了：检测到pod使用的CPU资源，如果异常，则触发自动重启（而不是开发或运维，偶然间发现pod异常，然后手动重启）。

当然，另一方面，如果pod使用的CPU异常飙高，开发人员需要去检查应用或代码哪里出现问题。

调研

主要就是请教ChatGPT，给出的答复如下：

通过资源限制间接实现自动重启

设置CPU使用的限制值（limits），当Pod的CPU使用超过这个值时，Kubernetes将限制CPU使用，这可能导致应用程序不可用或进入异常状态，从而触发健康检查并重启Pod：

apiVersion: v1
kind: Pod
metadata:name: my-pod
spec:containers:- name: my-containerimage: my-imageresources:requests:cpu: "500m"  # 请求的CPU资源limits:cpu: "1000m"  # 最大可使用的CPU资源livenessProbe:httpGet:path: /healthzport: 8080initialDelaySeconds: 5periodSeconds: 5

自定义监控与重启机制

如果需要更精确地监控CPU使用率并自动重启Pod，可以使用Prometheus和Kubernetes API结合实现。以下是实现步骤：

部署Prometheus和Alertmanager
Prometheus用于监控Pod的CPU使用率，Alertmanager用于根据Prometheus报警规则触发动作。Prometheus Alert配置示例：当某个Pod的CPU使用率超过90%时触发报警：

groups:
- name: pod-cpu-usagerules:- alert: HighCpuUsageexpr: sum(rate(container_cpu_usage_seconds_total[1m])) by (pod) / sum(kube_pod_container_resource_limits_cpu_cores) by (pod) > 0.9for: 1mlabels:severity: criticalannotations:summary: "Pod {{ $labels.pod }} CPU usage is above 90%"description: "Pod {{ $labels.pod }} is using more than 90% of its allocated CPU resources."

【备注】：上面这个配置脚本里的expr表达式有问题，下文会给出纠正。

配置 Alertmanager 动作
通过Alertmanager的webhook配置触发一个自动化脚本，使用Kubernetes API重启对应的Pod：

receivers:
- name: "restart-pod"webhook_configs:- url: "http://your-webhook-service/restart-pod"

实现 Webhook 服务
使用脚本监听Alertmanager的webhook请求，并调用Kubernetes API重启Pod：

from flask import Flask, request
import subprocessapp = Flask(__name__)@app.route('/restart-pod', methods=['POST'])
def restart_pod():data = request.jsonpod_name = data['alerts'][0]['labels']['pod']namespace = "default"  # 根据实际情况修改subprocess.run(["kubectl", "delete", "pod", pod_name, "-n", namespace])return "Pod restarted", 200if __name__ == "__main__":app.run(host="0.0.0.0", port=5000)

部署Webhook服务并配置为Alertmanager的webhook。

使用Kubernetes Operator

编写自定义Kubernetes Operator，实时监控Pod的CPU使用率，并在达到阈值时通过Kubernetes API（调用delete或restart操作）重启Pod。

实践

方案一：通过资源限制间接实现自动重启

原理：当Pod使用的CPU超过limits限制的数值时，Kubernetes将限制CPU使用，这可能导致应用程序不可用或进入异常状态，从而触发健康检查并重启Pod。

经过实践，限制resources.limits.cpu，并降低livenessProbe.timeoutSeconds的时间（最小可调整到1，即1s）。探针并不能监测到pod失活。也就是说，通过resources.limits.cpu + livenessProbe探针 + restartPolicy: Always，来监测pod已使用的CPU资源，并实现pod自动重启，并不可行，至少是不可靠（不精确）的。另外，限制服务可使用的资源，可能影响服务稳定性。

方案二：自定义监控与重启机制

Prometheus Alert

编写Prometheus Alert配置文件restart-pod.yaml：

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:labels:# labels与Prometheus CRD中match ruleSelector -> matchLabels保持一致。release: kube-prom-stackname: kube-state-metrics
spec:groups:- name: pod-cpu-usagerules:- alert: HighCpuUsageexpr: sum by (pod) (rate(container_cpu_usage_seconds_total[1m])) / sum by (pod) (kube_pod_container_resource_limits{resource="cpu"}) > 0.9for: 1mlabels:severity: criticalannotations:summary: "Pod {{ $labels.pod }} CPU usage is above 90%"description: "Pod {{ $labels.pod }} is using more than 90% of its allocated CPU resources."

然后执行命令：kubectl apply -f restart-pod.yaml -n observe

实际上，这里踩了不少坑。

前提知识：Prometheus Alert有3种状态（如下图）：Inactive、Pending、Firing。
在这里插入图片描述

Inactive：
Pending：
Firing

通过kubectl apply命令部署告警规则，打开Prometheus Alert页面，能找到自定义的规则pod-cpu-usage，但是一直处于Inactive状态。而实际上，测试环境好几个pod的CPU使用已经大于k8s限制值。

排查，还是使用Prometheus提供的功能，点击Graph，输入ChatGPT给出的expr表达式sum(rate(container_cpu_usage_seconds_total[1m])) by (pod) / sum(kube_pod_container_resource_limits_cpu_cores) by (pod) > 0.9
在这里插入图片描述
查询不到数据。

经过排查，根本就不存在kube_pod_container_resource_limits_cpu_cores这个Metric，存在kube_pod_container_resource_limits这个Metric，更进一步，存在kube_pod_container_resource_limits{resource="cpu"}这个Metric。

最后的查询效果是这样的：
在这里插入图片描述
kubectl apply部署修改后的restart-pod.yaml，即可看到有几个pod处于Pending状态：

另外，Prometheus Graph页面比较简单，也可用Grafana来实现，即，可以在Grafana上面验证Prometheus Alert脚本里的expr表达式正确与否。

Alertmanager

接下来就是修改Alertmanager，这里先给出alertmanager.yaml文件配置：

global:resolve_timeout: 5m
inhibit_rules:
- equal:- namespace- alertnamesource_matchers:- severity = criticaltarget_matchers:- severity =~ warning|info
- equal:- namespace- alertnamesource_matchers:- severity = warningtarget_matchers:- severity = info
- equal:- namespacesource_matchers:- alertname = InfoInhibitortarget_matchers:- severity = info
- target_matchers:- alertname = InfoInhibitor
receivers:
- name: "restart-pod"webhook_configs:- url: "http://33.44.55.66:5000/restart-pod"
route:group_by:- namespacegroup_interval: 5mgroup_wait: 30sreceiver: "restart-pod"repeat_interval: 12hroutes:- matchers:- alertname = "Watchdog"receiver: "restart-pod"
templates:
- /etc/alertmanager/config/*.tmpl

主要就是新增一个receivers。

背景知识：Alertmanager是保密字典。

如果对k8s + Helm很熟悉的话，可以在k8s环境下通过命令行来修改。

我不太熟悉，于是通过KubeSphere来操作，点击【配置】-【保密字典】，搜索alertmanager（注意：KubeSphere大小写敏感，搜索Alertmanager将搜索不到结果）：
在这里插入图片描述
点击进去，页面是这样，可看到一个alertmanager.yaml配置项：

点击右上角的展示【隐藏和显示】按钮，可以看到明文，也就是上面结构。

将明文复制出来，在Sublime Text里编辑，增加receivers配置，最后需要将修改后的alertmanager.yaml文本内容，以Base64编码，一个不错的Base64编码-解码在线工具

然后点击alertmanager-kube-prom-stack-alertmanager，编辑YAML，替换更新后的Base64编码内容：
在这里插入图片描述
注意换行和空格：

Webhook

最后就是调试Python脚本，先给出最终的版本：

from flask import Flask, request
import subprocess
import logging
import jsonpath
logger = logging.getLogger(__name__)app = Flask(__name__)@app.route('/restart-pod', methods=['POST'])
def restart_pod():logging.basicConfig(filename='restart-pod.log', level=logging.INFO)data = request.jsonpods_name = jsonpath.jsonpath(data, '$.alerts[*].labels.pod')logger.info(pods_name)# 非法JSON解析为bool？if isinstance(pods_name, bool):logger.info("ignored")return "ignored", 200namespace = "test-tesla" # 按需修改for pod_name in pods_name:subprocess.run(["kubectl", "delete", "pod", pod_name, "-n", namespace])logger.info("kubectl delete pod:%s", pod_name)return "Pod restarted", 200if __name__ == "__main__":app.run(host="33.44.55.66", port=5000)

这里遇到的问题：

ModuleNotFoundError: No module named ‘flask’
不好调试，引入logging模块
解析JSON响应
引入JsonPath模块
获取到的pod不止一个，也就是上面截图里看到的，有3个pod处于Pending状态，因此引入for循环
其他报错，如：for pod_name in pods_name: TypeError: 'bool' object is not iterable

后台进程运行Python脚本的命令为：nohup python restart-pod-webhook.py > tmp.log

遗留问题

通过JsonPath表达式'$.alerts[*].labels.pod'解析responseBody，不知道为啥会解析到bool类型数据。

方案三：Kubernetes Operator

编写自定义Kubernetes Operator，实时监控Pod的CPU使用率，并在达到阈值时通过Kubernetes API重启 Pod。
有不低的门槛，需要熟悉K8S Operator框架，了解Go语法，熟悉k8s提供的API。

参考

ChatGPT
Google