promethues上监控K3S中的pod的状态
第一步:安装kube-state-metrics
1、解决方案1:下载 Chart 文件
wget https://charts.bitnami.com/bitnami/kube-state-metrics-3.4.1.tgz
tar -zxvf kube-state-metrics-3.4.1.tgz
cd kube-state-metrics
2、如果value.yaml文件中镜像已经是国内源可以不用修改,不然修改如下
image:registry: docker.iorepository: bitnami/kube-state-metricstag: 1.9.7
3、安装
helm install kube-state-metrics . \--namespace kube-system \--create-namespace
解决方案2:直接使用kubectl部署
1、如果 Helm 完全不可用,可以直接通过 YAML 部署:
kubectl apply -f https://raw.githubusercontent.com/bitnami/charts/main/bitnami/kube-state-metrics/templates/deployment.yaml
# 手动修改 YAML 中的镜像为国内源
2、验证安装
kubectl get pods -n kube-system -l app.kubernetes.io/name=kube-state-metrics
kubectl logs -n kube-system -l app.kubernetes.io/name=kube-state-metrics
第二步:配置promethues.yaml文件,添加kube-state-metrics
- job_name: 'kube-state-metrics'static_configs:- targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']labels:service: kube-state-metrics
第三步:重启promethues后,点击promethues中的Target页面,显示刚刚添加的metrics为up状态。
第四步,修改rules.yaml文件,添加监控pod的告警规则
# Pod卡在Pending状态- alert: PodStuckPendingexpr: kube_pod_status_phase{phase="Pending"} == 1for: 5mlabels:severity: criticalannotations:summary: "Pod {{ $labels.pod }} stuck in Pending"description: "Pod {{ $labels.pod }} in {{ $labels.namespace }} has been Pending for 10+ minutes"# Pod处于Failed状态- alert: PodFailedexpr: kube_pod_status_phase{phase="Failed"} == 1for: 2mlabels:severity: criticalannotations:summary: "Pod {{ $labels.pod }} in Failed state"# 容器频繁重启- alert: PodCrashLoopingexpr: rate(kube_pod_container_status_restarts_total[5m]) > 0.5for: 5mlabels:severity: criticalannotations:summary: "Pod {{ $labels.pod }} crash looping"# 容器OOM被杀- alert: ContainerOOMKilledexpr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1labels:severity: criticalannotations:summary: "Container {{ $labels.container }} OOM killed"
第五步:重启promethues后,点击Alerts的页面,显示添加的告警规则