一ã€ä¸€æ¬¡ç‹¼æ¥äº†äº‹ä»¶ç»™æˆ‘的教è®2018年,我ç»åŽ†äº†ä¸€æ¬¡è®©æˆ‘åˆ»éª¨é“心的事æ•。那天凌晨2点,手机疯狂震动,告è¦çŸä¿¡ä¸€æ¡æŽ¥ä¸€æ¡ï¼šRedis连接数è¶é™ã€æ•°æ®åº“CPU 99%ã€æŽ¥å£å“应时间è¶è¿‡5秒。我爬起æ¥ä¸€çœ‹ï¼Œå‘Šè¦ç³»ç»Ÿæ˜¾ç¤ºç³»ç»Ÿå·²ç»æŒ‚了。我赶紧爬起æ¥å¤„ç†ï¼Œç»“æžœå‘现——什么问题都没有。Redisæ£å¸¸ï¼Œæ•°æ®åº“æ£å¸¸ï¼ŒæŽ¥å£å“åº”æ—¶é—´åªæœ‰å‡ 忝«ç§’ã€‚ç¬¬äºŒå¤©ä¸€é—®ï¼ŒåŽŸæ¥æ˜¯è¿ç»´åŒå¦åœ¨å‡Œæ™¨1点åšäº†ä¸€æ¬¡æ•°æ®åº“维护,触å‘了大é‡çš„临时告è¦ï¼Œç„¶åŽè¿™äº›å‘Šè¦åœ¨2点集ä¸å‘é€å‡ºæ¥ã€‚但这些告è¦éƒ½æ˜¯æ— 效告è¦â€”—系统在维护期间本æ¥å°±æ˜¯ä¸æ£å¸¸çš„。从那以åŽï¼Œæˆ‘们团队开始认真æ€è€ƒå‘Šè¦ä½“ç³»çš„è®¾è®¡ï¼šä»€ä¹ˆæ ·çš„å‘Šè¦æ‰æ˜¯çœŸæ£æœ‰ä»·å€¼çš„ï¼ŸäºŒã€æŒ‡æ ‡ä½“系设计:让系统å¯è§åšç›‘控,首åˆè¦æ¸æ¥šç›‘控什么。业界有两个ç»å¸çš„监控方法论:RED方法和USE方法。2.1 RED方法(é¢å‘æœåŠ¡ï¼‰é€‚ç”¨äºŽæ— çŠ¶æ€æœåŠ¡ï¼ˆå¦‚HTTP API):Rate:请求速率(QPS/TPS)Error:错误率Duration:å“应时间分布(p50/p90/p99)2.2 USE方法(é¢å‘资æºï¼‰é€‚用于系统资æºï¼ˆå¦‚CPUã€å†å˜ã€ç£ç›˜ï¼‰ï¼šUtilization:利用率Saturation:饱和度Errors:错误数2.3 æˆ‘ä»¬çš„æŒ‡æ ‡ä½“ç³»æˆ‘ä»¬æœ€ç»ˆè®¾è®¡çš„æŒ‡æ ‡ä½“ç³»å¦‚ä¸‹ï¼š# å¾®æœåŠ¡æŒ‡æ ‡é‡‡é›†é置(Prometheusæ ¼å¼ï¼‰# ä¸šåŠ¡æŒ‡æ ‡app_business:order_count:type:counterdescription:订å•创建数é‡labels:[service,status]payment_amount:type:counterdescription:支付金é¢labels:[service,payment_type]active_users:type:gaugedescription:活跃用户数labels:[service]# HTTPæŒ‡æ ‡http_requests:total:type:counterdescription:HTTP请求总数labels:[method,path,status]duration_seconds:type:histogramdescription:HTTPå“应时间labels:[method,path]buckets:[0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10]# JVMæŒ‡æ ‡jvm:memory_used_bytes:type:gaugedescription:JVMå·²ä½¿ç”¨å† å˜labels:[area,service]gc_count:type:counterdescription:GC次数labels:[gc_type]thread_count:type:gaugedescription:活跃线程数labels:[thread_type]# æ•°æ®åº“æŒ‡æ ‡database:connections:type:gaugedescription:æ•°æ®åº“连接数labels:[pool_name]query_duration_seconds:type:histogramdescription:SQL执行时间labels:[operation,table]# ç¼“å˜æŒ‡æ ‡redis:commands_total:type:counterdescription:Redis命令总数labels:[command,status]keyspace_keys:type:gaugedescription:Keyæ•°é‡labels:[db]memory_used_bytes:type:gaugedescription:Redisä½¿ç”¨å† å˜2.4 å³é”®å‘Šè¦é˜ˆå€¼è®¾è®¡# AlertManager告è¦è§„则éç½®groups:-name:business_alertsrules:-alert:HighErrorRateexpr:|sum(rate(http_requests_total{status~5..}[5m])) / sum(rate(http_requests_total[5m])) 0.05for:5mlabels:severity:criticalannotations:summary:æœåŠ¡{{ $labels.service }}错误率过高description:5åˆ†é’Ÿå† é”™è¯¯çŽ‡è¶ è¿‡5%,当å‰å€¼ï¼š{{$value|printf \%.2f\}}%-alert:OrderCountDropexpr:|sum(increase(app_business_order_count[10m])) 100for:5mlabels:severity:warningannotations:summary:è®¢å•æ•°é‡å¼‚常下é™description:最近10åˆ†é’Ÿè®¢å•æ•°å°‘于100å•,å¯èƒ½å˜åœ¨ä¸šåŠ¡é—®é¢˜-name:infrastructure_alertsrules:-alert:HighCPUUsageexpr:|100 - (avg by(instance) (rate(node_cpu_seconds_total{modeidle}[5m])) * 100) 80for:10mlabels:severity:warningannotations:summary:CPU使用率过高description:æœåС噍{{$labels.instance}}CPUä½¿ç”¨çŽ‡è¶ è¿‡80%-alert:JVMHeapMemoryHighexpr:|jvm_memory_used_bytes{areaheap} / jvm_memory_max_bytes{areaheap} 0.85for:5mlabels:severity:warningannotations:summary:JVMå †å† å˜ä½¿ç”¨çŽ‡è¿‡é«˜description:æœåŠ¡{{$labels.service}}å †å† å˜ä½¿ç”¨çŽ‡è¶ è¿‡85%-alert:DatabaseConnectionPoolExhaustedexpr:|datasource_connections_active / datasource_connections_max 0.9for:2mlabels:severity:criticalannotations:summary:æ•°æ®åº“è¿žæŽ¥æ± å³å°†è€—å°½description:{{$labels.pool_name}}è¿žæŽ¥æ± ä½¿ç”¨çŽ‡è¶ è¿‡90%三ã€Grafanaå¤§ç›˜è®¾è®¡å‰æœ‰å‘Šè¦è¿˜ä¸å¤Ÿï¼Œè¿˜éœ€è¦å¯è§†åŒ–大盘让团队对系统状æ€ä¸€ç›®äº†ç„¶ã€‚3.1 大盘分层设计┌─────────────────────────────────────────────────────────────────┠│ ç³»ç»Ÿæ¦‚è§ˆå¤§å± â”‚ ├──────────────┬──────────────┬──────────────┬───────────────────┤ │ 请求QPS │ 错误率 │ å¹³å‡å“应时间 │ 在线用户数 │ │ â–“â–“â–“â–“â–“â–‘â–‘â–‘ │ 0.12% ✠│ 45ms ✠│ 12,345 │ ├──────────────┴──────────────┴──────────────┴───────────────────┤ │ 儿œåŠ¡å¥åº·çŠ¶æ€ â”‚ ├──────────────┬──────────────┬──────────────┬───────────────────┤ │ è®¢å•æœåŠ¡ │ 支付æœåŠ¡ │ 用户æœåŠ¡ │ 商哿œåŠ¡ │ │ â— å¥åº· 42ms │ â— å¥åº· 38ms│ â— å¥åº· 25ms │ â— å¥åº· 55ms │ ├──────────────┴──────────────┴──────────────┴───────────────────┤ │ åŸºç¡€è®¾æ–½çŠ¶æ€ â”‚ ├──────────────┬──────────────┬──────────────┬───────────────────┤ │ CPU: 45% │ å† å˜: 62% │ ç£ç›˜: 78% │ 网络: æ£å¸¸ │ │ â–“â–“â–“â–“â–‘â–‘â–‘â–‘â–‘ │ â–“â–“â–“â–“â–“â–“â–‘â–‘â–‘ │ â–“â–“â–“â–“â–“â–“â–“â–“â–‘â–‘ │ â— æ£å¸¸ │ └──────────────┴──────────────┴──────────────┴───────────────────┘3.2 Grafana Dashboard JSONéç½®{dashboard:{title:å¾®æœåŠ¡ç›‘æŽ§å¤§ç›˜,tags:[microservice,production],timezone:Asia/Shanghai,panels:[{title:æœåŠ¡è¯·æ±‚QPS,type:graph,gridPos:{x:0,y:0,w:12,h:8},targets:[{expr:sum(rate(http_requests_total{service~\$service\}[1m])) by (service),legendFormat:{{service}}}],alert:{name:QPS异常告è¦,conditions:[{evaluator:{params:[10],type:lt},operator:{type:and},query:{params:[A,5m,now]},reducer:{type:avg}}],frequency:1m,noDataState:no_data}},{title:æœåŠ¡é”™è¯¯çŽ‡,type:stat,gridPos:{x:12,y:0,w:6,h:4},targets:[{expr:sum(rate(http_requests_total{status~\5..\}[5m])) / sum(rate(http_requests_total[5m])) * 100}],fieldConfig:{defaults:{thresholds:{mode:absolute,steps:[{color:green,value:null},{color:yellow,value:1},{color:red,value:5}]},unit:percent,decimals:2}}},{title:JVMå †å† å˜ä½¿ç”¨çއ,type:gauge,gridPos:{x:18,y:0,w:6,h:4},targets:[{expr:jvm_memory_used_bytes{area\heap\} / jvm_memory_max_bytes{area\heap\} * 100}],fieldConfig:{defaults:{thresholds:{mode:percentage,steps:[{color:green,value:null},{color:yellow,value:70},{color:red,value:85}]},unit:percent,max:100}}},{title:æ•°æ®åº“è¿žæŽ¥æ± ,type:graph,gridPos:{x:0,y:8,w:12,h:8},targets:[{expr:datasource_connections_active{pool_name~\$pool\},legendFormat:活跃连接},{expr:datasource_connections_idle{pool_name~\$pool\},legendFormat:空闲连接},{expr:datasource_connections_max{pool_name~\$pool\},legendFormat:最大连接}]}]}}å››ã€å‘Šè¦åˆ†çº§ä¸Žæ”¶æ•›4.1 告è¦åˆ†çº§ç–略我们把告è¦åˆ†ä¸º5个级别:级别å称定义å“应时间通知方å¼P0æœ€é«˜æ ¸å¿ƒä¸šåŠ¡å®Œå¨ä¸å¯ç”¨5分钟å†ç”µè¯ çŸä¿¡ 钉钉P1é«˜æ ¸å¿ƒåŠŸèƒ½å—æŸ15分钟å†çŸä¿¡ 钉钉P2ä¸éžæ ¸å¿ƒåŠŸèƒ½å¼‚å¸¸1å°æ—¶å†é’‰é’‰P3低潜在风险工作时间处ç†é‚®ä»¶P4æç¤ºæ— å³ç´§è¦çš„æç¤ºä¸å¤„ç†æ—¥å¿—4.2 å‘Šè¦æ”¶æ•›ç–略告è¦é£Žæš´æ˜¯æœ€å¤§çš„æ•Œäººã€‚我们使用AlertManager的分组和抑制功能:# AlertManageréç½®global:resolve_timeout:5msmtp_smarthost:smtp.example.com:587smtp_from:alertexample.com# 告è¦è·¯ç”±éç½®route:group_by:[alertname,cluster,service]group_wait:30s# ç‰å¾30秒分组group_interval:5m# æ¯5分钟å‘é€ä¸€æ¬¡åˆ†ç»„告è¦repeat_interval:4h# é‡å¤å‘Šè¦é—´éš”4å°æ—¶receiver:defaultroutes:# P0/P1告è¦ç«‹å³å‘é€-match:severity:criticalreceiver:critical-alertsgroup_wait:0srepeat_interval:1h# P2å‘Šè¦æ”¶æ•›åŽå‘é€-match:severity:warningreceiver:warning-alertsgroup_wait:1mrepeat_interval:4h# 按æœåŠ¡åˆ†ç»„-match:service:order-servicereceiver:order-teamroutes:-match:severity:criticalreceiver:order-team-criticalgroup_wait:0sreceivers:-name:critical-alerts# 电è¯é€šçŸ¥ï¼ˆä½¿ç”¨è¾è®¯äº‘ç”µè¯æœåŠ¡ï¼‰webhook_configs:-url:http://alert-phone.example.com/callsend_resolved:true# 钉钉通知webhook_configs:-url:http://dingtalk.example.com/webhooksend_resolved:trueheaders:Content-Type:application/jsonmax_alerts:10# 邮件通知email_configs:-to:oncallexample.comsend_resolved:true-name:warning-alertswebhook_configs:-url:http://dingtalk.example.com/webhook-warning# åªå‘钉钉,ä¸å‘电è¯å’Œé‚®ä»¶# å‘Šè¦æŠ‘åˆ¶è§„åˆ™inhibit_rules:# 当æœåŠ¡å™¨å®•æœºæ—¶ï¼ŒæŠ‘åˆ¶è¯¥æœåŠ¡å™¨ä¸Šæ‰€æœ‰æœåŠ¡çš„æ‰€æœ‰å‘Šè¦-source_match:alertname:ServerDownsource_labels:[instance]target_match_re:alertname:.*target_labels:instance:{{ $value }}equal:[cluster]# 当整个集群ä¸å¯ç”¨æ—¶ï¼ŒæŠ‘制该集群上所有æœåŠ¡çš„æ‰€æœ‰å‘Šè¦-source_match:alertname:ClusterDownsource_labels:[cluster]target_match_re:alertname:.*target_labels:cluster:{{ $value }}equal:[namespace]五ã€è¸©å‘实录:告è¦ä½“系的血泪教è®å‘1:告è¦é£Žæš´å¯¼è‡´ç‹¼æ¥äº†æ•ˆåº”这是我们踩过最大的å‘。有一次,数æ®åº“主从切æ¢ï¼Œè§¦å‘äº†å‡ ç™¾æ¡å‘Šè¦ã€‚è¿ç»´äººå‘˜è¢«æ·¹æ²¡åœ¨å‘Šè¦çš„æµ·æ´‹é‡Œï¼Œé”™è¿‡äº†çœŸæ£é‡è¦çš„告è¦â€”—应用æœåС噍ç£ç›˜æ»¡äº†ã€‚解决:使用AlertManagerçš„group_by对åŒç±»å‘Šè¦è¿›è¡Œèšåˆè®¾ç½®åˆç†çš„group_wait(30秒),é¿å告è¦ç¢Žç‰‡åŒ–é置抑制规则,上游æ•障时抑制下游告è¦å‘2:误告è¦é€ æˆä¸å¿è¦çš„紧急å“åº”æœ‰äº›ç›‘æŽ§æŒ‡æ ‡æœ¬èº«æœ‰æ¯›åˆºï¼ˆæ¯”å¦‚çž¬æ—¶CPU飙å‡ï¼‰ï¼Œä½†æˆ‘ä»¬çš„å‘Šè¦æ²¡æœ‰è®¾ç½®åˆç†çš„for时长,导致瞬时波动就触å‘告è¦ã€‚解决:所有告è¦å¿é¡»è®¾ç½®for时长(通常5分钟),é¿å瞬时波动触å‘告è¦å³é”®æŒ‡æ ‡å¢žåŠ æ¸è¿›å‘Šè¦ï¼šåˆwarning,过更长时间å†critical定期Review告è¦è§„则,æ¸ç†æ— 效告è¦å‘3ï¼šå¤œé—´å‘Šè¦æ²¡äººç®¡æœ‰ä¸€æ¬¡ï¼ŒP2级别的告è¦åœ¨å‡Œæ™¨3点触å‘ï¼Œä½†åªæœ‰ä¸€ä¸ªäººæ”¶åˆ°äº†é€šçŸ¥ï¼Œè€Œè¿™ä¸ªäººç¡ç€äº†ã€‚第二天早上æ‰å‘现问题。解决:建立值ç制度,明确æ¯ä¸ªæ—¶é—´æ®µçš„值ç人P0/P1告è¦å¿é¡»ç”µè¯é€šçŸ¥ï¼Œä¸”需è¦ç¡®è®¤æ”¶åˆ°P2告è¦åœ¨éžå·¥ä½œæ—¶é—´å‘é€ç»™å€¼çäººï¼Œä¸æ‰“扰å¶ä»–人å‘4:告è¦é€šçŸ¥æ¸ é“å•一我们最开始åªç”¨é‚®ä»¶é€šçŸ¥å‘Šè¦ï¼Œç»“æžœå‘çŽ°ï¼šç´§æ€¥å‘Šè¦æ²¡äººçœ‹é‚®ä»¶é‚®ä»¶å»¶è¿Ÿå¯¼è‡´å“应ä¸åŠæ—¶è§£å†³ï¼šå»ºç«‹å¤šæ¸ é“告è¦ï¼šç”µè¯ï¼šP0级别,å¿é¡»æŽ¥å¬çŸä¿¡ï¼šP0/P1级别钉钉/飞书群:所有级别邮件:P3/P4级别,ä»ä¾›è®°å½•åã€ä¸šåŠ¡åœºæ™¯ï¼šæŸé‡‘èžå¬å¸æå»ºå®Œæ•´å¯è§‚测体系的完整过程这家å¬å¸ï¼ˆæˆ‘们å«ä»–Aå¬å¸ï¼‰æ˜¯ä¸€å®¶åšæ¶ˆè´¹é‡‘èžçš„创业å¬å¸ã€‚2020年,他们从零开始æå»ºå¯è§‚æµ‹ä½“ç³»ã€‚ç¬¬ä¸€é˜¶æ®µï¼šåªæœ‰åŸºç¡€ç›‘控(2020å¹´Q1ï¼‰å½“æ—¶ä»–ä»¬çš„ç›‘æŽ§çŠ¶æ€æ˜¯ï¼šåªæœ‰æœåŠ¡å™¨åŸºç¡€ç›‘æŽ§ï¼ˆCPUã€å†å˜ã€ç£ç›˜ï¼‰æ²¡æœ‰ä»»ä½•应用层监控告è¦åªæœ‰é‚®ä»¶æ¯å¤©æ—©ä¸Šçœ‹ä¸€æ¬¡ç›‘控大盘问题:ç»å¸¸æ”¶åˆ°ç”¨æˆ·æŠ•诉系统æ¢äº†ï¼Œä½†å¼€å‘团队完å¨ä¸çŸ¥é“å‘生了什么。第二阶段:引å¥APM(2020å¹´Q2)引å¥äº†SkyWalking APM,æå»ºäº†é“¾è·¯è¿½è¸ªèƒ½åŠ›ï¼šçœ‹åˆ°äº†æœåŠ¡é—´çš„è°ƒç”¨å³ç³»å‘现了大é‡çš„æ¢SQL看到了æ¯ä¸ªæŽ¥å£çš„å“应时间分布改善:能定ä½é—®é¢˜äº†ï¼Œä½†è¿˜æ˜¯è¢«åŠ¨â€”â€”æ€»æ˜¯åœ¨å‡ºé—®é¢˜åŽæ‰çŸ¥é“。第三阶段:建立完整å¯è§‚测体系(2020å¹´Q3-Q4ï¼‰å»ºç«‹äº†å®Œæ•´çš„ä¸‰æ¿æ–§å¯è§‚测体系:日志:ELK Stack结构化日志(JSONæ ¼å¼ï¼‰TraceID贯穿所有日志日志èšåˆå’Œæœç´¢æŒ‡æ ‡ï¼šPrometheus GrafanaRED方法覆盖所有HTTPæœåŠ¡USEæ–¹æ³•è¦†ç›–æ‰€æœ‰åŸºç¡€è®¾æ–½è‡ªå®šä¹‰ä¸šåŠ¡æŒ‡æ ‡é“¾è·¯ï¼šSkyWalkingå¨é“¾è·¯è¿½è¸ªæ‹“æ‰‘å›¾è‡ªåŠ¨ç”Ÿæˆæ¢æœåŠ¡åˆ†æžå‘Šè¦ï¼šAlertManager 钉钉分级告è¦ï¼ˆP0-P4)告è¦èšåˆå’Œæ”¶æ•›å€¼çæœºåˆ¶æˆæžœï¼šMTTD(平å‡å‘现时间)从4å°æ—¶ç¼©çŸåˆ°5分钟MTTRï¼ˆå¹³å‡æ¢å¤æ—¶é—´ï¼‰ä»Ž2å°æ—¶ç¼©çŸåˆ°30分钟用户投诉系统æ¢çš„æ•°é‡ä¸‹é™äº†80%ä¸ƒã€æ€»ç»“与æ€è€ƒç›‘控告è¦ä½“系建设的å³é”®è¦ç‚¹ï¼š**分层监控**:基础设施层ã€åº”用层ã€ä¸šåŠ¡å±‚éƒ½è¦è¦†ç›–é»„é‡‘æŒ‡æ ‡ï¼šå»¶è¿Ÿã€æµé‡ã€é”™è¯¯çއã€é¥±å’Œåº¦æ˜¯æ ¸å¿ƒ**告è¦åˆ†çº§**ï¼šä¸æ˜¯æ‰€æœ‰å‘Šè¦éƒ½ä¸€æ ·é‡è¦ï¼Œè¦åˆ†çº§å¤„ç†å‘Šè¦æ”¶æ•›ï¼šé¿å告è¦é£Žæš´æ·¹æ²¡çœŸæ£é‡è¦çš„告è¦**å€¼çæœºåˆ¶**:确ä¿å‘Šè¦æœ‰äººå“应,ä¸èƒ½çŸ³æ²‰å¤§æµ·æŒç»ä¼˜åŒ–:定期Review告è¦è§„则,æ¸ç†æ— 效告è¦è¡€çš„æ•™è®ï¼šå‘Šè¦ä½“ç³»çš„æ ¸å¿ƒä¸æ˜¯å‘现所有问题,而是å‘现真æ£éœ€è¦äººå·¥ä»‹å¥çš„问题。告è¦å¤ªå¤šå’Œå‘Šè¦å¤ªå°‘åŒæ ·æœ‰å®³ã€‚ç»™ä½ çš„æ€è€ƒé¢˜ï¼šä½ 们团队的告è¦ä½“系有没有狼æ¥äº†çš„问题?如果åŠå¤œæ”¶åˆ°ä¸€ä¸ªP2告è¦ï¼Œä½ 会怎么处ç†ï¼Ÿä¸ªäººè§‚点,ä»ä¾›å‚考