NVSentinel主要功能及对应组件

📅 2026/7/3 4:25:49
NVSentinel主要功能及对应组件
NVSentinel 五大核心大功能及对应组件一、硬件/集群全维度监控采集故障源头探测。功能说明部署在每台GPU节点对接DCGM、系统日志、网卡、云厂商API、K8s资源采集所有GPU/网卡/调度异常统一标准化为HealthEvent是整个系统的数据输入端。包含组件gpu-health-monitorDaemonSet对接DCGM采集GPU温度、ECC、XID、NVLink故障syslog-health-monitorDaemonSet监听系统journal日志抓取GPU驱动崩溃、PCI报错nic-health-monitorDaemonSet监控网卡、RDMA、高速互联链路硬件异常metadata-collectorDaemonSet采集GPU、网卡、NVSwitch硬件拓扑元数据csp-health-monitorDeployment轮询公有云API获取VM计划性维护事件kubernetes-object-monitorDeployment监听K8s Node/Pod/CRD资源变更生成集群资源事件preflightInit容器模板GPU Pod启动前置硬件自检异常上报事件platform-connectorsDaemonSet采集网关接收所有采集组件gRPC事件、校验标准化、写入数据库、同步更新K8s Node Condition二、存储与事件总线全系统数据协同底座。功能说明统一持久化所有HealthEvent提供Change Streams变更流让所有控制面组件异步解耦协同存储故障全生命周期状态。包含组件mongodb-storeStatefulSet生产默认主事件存储提供Change Streams订阅能力k8s-datastoreStatefulSet测试轻量替代基于K8s CRD存储事件无Change StreamspostgresqlStatefulSet辅助存储存储审计日志、分析统计数据incluster-file-serverDeployment全局配置中心下发监控阈值、CEL隔离规则、故障映射模板支撑监控自愈全部模块三、全自动故障自愈闭环总结核心核心功能故障处置流水线。功能说明读取数据库故障事件按「节点隔离 → 优雅驱逐业务Pod → 生成硬件修复工单 → 执行服务器重启/更换」完整自动化流程支持自定义CEL规则、Slurm混合集群适配。包含组件fault-quarantineDeploymentCEL规则引擎判断是否封锁节点、添加故障污点更新事件隔离状态node-drainerDeployment按命名空间策略优雅驱逐故障节点所有业务Pod配套slinky-drainer适配K8sSl混合集群slurm-drain-monitorDeploymentK8s与Slurm调度双向同步故障节点自动排空Slurm作业fault-remediationDeployment驱逐完成后根据故障动作生成RebootNode/TerminateNode运维CR工单janitorDeploymentCR控制器监听修复工单下发硬件操作指令janitor-providerDeployment底层硬件驱动抽象层对接IPMI、公有云、机房硬件API执行重启/换卡四、事件分析、大盘与告警输出总结观测、复盘、批量风险预警。功能说明只读消费全量故障事件做多维度聚合统计、连锁故障识别、时序指标输出同时支持把单条/聚合告警推送外部运维平台。包含组件health-events-analyzerDeployment故障聚合分析、批量连锁故障识别、Prometheus指标生成、集群风险复合告警、历史审计统计event-exporterDeployment单条原始故障事件透传推送Webhook/第三方告警系统labelerDeployment自动给K8s节点打上GPU驱动、DCGM、硬件型号标签辅助分析过滤五、外部集成扩展能力总结对接第三方调度、运维、监控平台。功能说明打通异构调度集群、外部运维系统、监控平台实现跨平台故障同步、双向联动。包含组件slurm-drain-monitorK8s ↔ Slurm 调度双向同步集成csp-health-monitor公有云维护通知集成event-exporter外部告警平台钉钉/企业微信/监控系统集成janitor janitor-provider机房DCIM、IPMI、公有云API硬件运维集成preflight准入Webhook容器调度前置硬件校验集成slinky-drainerSlinky Slurm Operator混合集群驱逐集成六、整体组件架构Health Detection LayerEvent Processing CoreAnalysis LayerResponse Automation LayerSupport ServicesKubernetes APINodes(Cordon, Taint, Labels)Pods(Eviction)Custom Resources(RebootNode, TerminateNode)labeler(Deployment)modules/labeler/metadata-collector(Deployment)modules/metadata-collector/log-collector(Job)modules/log-collector/fault-quarantine(Deployment)modules/fault-quarantine/node-drainer(Deployment)modules/node-drainer/fault-remediation(Deployment)modules/fault-remediation/janitor(Deployment)modules/janitor/health-events-analyzer(Deployment)modules/health-events-analyzer/platform-connectors(DaemonSet)modules/platform-connectors/mongodb-store / postgresql(StatefulSet)distros/kubernetes/nvsentinel/charts/mongodb-store/gpu-health-monitor(DaemonSet)health-monitors/gpu-health-monitor/syslog-health-monitor(DaemonSet)health-monitors/syslog-health-monitor/csp-health-monitor(Deployment)health-monitors/csp-health-monitor/kubernetes-object-monitor(Deployment)health-monitors/kubernetes-object-monitor/总结采集层各类-health-monitor platform-connectors存储配置层mongodb-store / postgresql / incluster-file-server自愈处置层fault-quarantine → node-drainer → fault-remediation → janitor套件观测分析层health-events-analyzer、event-exporter、labeler异构集成层slurm-drain-monitor、csp-health-monitor、janitor-provider组件类型语言功能说明默认状态gpu-health-monitorDaemonSetPython通过 DCGM 监控 GPU 硬件状态XID 错误、温度、ECC 等Enabled distros/kubernetes/nvsentinel/values.yaml144syslog-health-monitorDaemonSetGo通过 journalctl 解析系统日志检测硬件故障Enabled distros/kubernetes/nvsentinel/values.yaml160csp-health-monitorDeploymentGo轮询云厂商 API获取维护事件Disabled distros/kubernetes/nvsentinel/values.yaml158kubernetes-object-monitorDeploymentGo基于 CEL 策略监控 Kubernetes 资源Disabled distros/kubernetes/nvsentinel/values.yaml172platform-connectorsDaemonSetGo接收健康事件的 gRPC 服务并将事件持久化到数据存储Enabled distros/kubernetes/nvsentinel/values-tilt.yaml131mongodb-storeStatefulSet—通过变更流持久化健康事件Disabled (internal) distros/kubernetes/nvsentinel/values.yaml170health-events-analyzerDeploymentGo基于聚合管道进行事件模式检测Disabled distros/kubernetes/nvsentinel/values.yaml146fault-quarantineDeploymentGo基于 CEL 规则引擎执行节点隔离Disabled distros/kubernetes/nvsentinel/values.yaml148node-drainerDeploymentGo按命名空间策略驱逐 PodDisabled distros/kubernetes/nvsentinel/values.yaml150fault-remediationDeploymentGo通过 Go 模板创建维护类 CRDisabled distros/kubernetes/nvsentinel/values.yaml152janitorDeploymentGo通过云厂商 API 执行节点重启或终止Disabled distros/kubernetes/nvsentinel/values.yaml154labelerDeploymentGo自动为节点标注 DCGM 和驱动版本信息Enabled distros/kubernetes/nvsentinel/values.yaml162metadata-collectorDeploymentGo采集 GPU 拓扑信息PCI、UUID 等Enabled distros/kubernetes/nvsentinel/values.yaml164log-collectorJobPython故障发生时采集诊断日志Disabled distros/kubernetes/nvsentinel/values-full.yaml226