从阿里云 SLS 到 AWS CloudWatch:K8s 日志采集 + 钉钉告警完整迁移实战

📅 2026/7/2 6:11:27
从阿里云 SLS 到 AWS CloudWatch:K8s 日志采集 + 钉钉告警完整迁移实战
背景有 3 台自建的 K8s 服务器跑着核心业务。之前一直用阿里云 SLS 采集日志但最近业务调整希望使用AWS。日志采集看起来是件小事但真做起来才发现采集器用什么日志怎么传告警怎么配钉钉怎么通知经过一系列的踩坑、试错、解决各种报错最终跑通了一条完整的链路K8s → Fluent Bit → CloudWatch → SNS→ Lambda →钉钉告警。AWS CloudWatch 日志接入K8s 集群日志采集 钉钉告警 完整部署指南1. 文档概述1.1 目的将自建 K8s 集群3台服务器containerd 运行时rocky linux 9.7的容器日志接入 AWS CloudWatch并配置钉钉告警通知。1.2 架构图┌─────────────────────────────────────────────────────────────────────────────┐ │ K8s 集群 (3台服务器) │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Fluent Bit (DaemonSet) │ │ │ │ - 采集 /var/log/containers/*.log │ │ │ │ - 解析 Kubernetes 元数据 │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ Amazon CloudWatch │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ 日志组: /aws/containerinsights/{cluster-name}/application │ │ │ │ - 日志流按 Pod 自动命名 │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ 指标筛选器: ErrorCount │ │ │ │ - 筛选条件: ERROR │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ 告警规则: K8s-Error-DingTalk │ │ │ │ - 条件: 5分钟内 ERROR 5次 │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SNS 主题: K8s-DingTalk-Alarm │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ Lambda: DingTalkNotifier │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ 钉钉群机器人 │ └─────────────────────────────────────────────────────────────────────────────┘2. 前置条件2.1 环境信息项目值AWS 账号 ID登录AWS获取AWS 区域按需选择K8s 版本v1.34.6节点数量31 control-plane 2 worker操作系统Rocky Linux 9.7容器运行时containerd 2.2.3日志路径/var/log/pods///*.log采集器Fluent Bit (aws-for-fluent-bit)2.2 所需权限AWS IAM 用户需具备以下权限CloudWatchAgentServerPolicyAmazonSNSFullAccessCloudWatchLogsFullAccessCloudWatchFullAccessAWSLambda_FullAccessIAMFullAccess创建角色和策略2.3 前置准备AWS CLI 已安装并配置kubectl 已安装并可连接集群Helm 已安装可选钉钉群机器人 Webhook 地址已准备3. 第一步AWS IAM 角色配置3.1 创建 IAM 角色在 AWS 控制台操作进入 IAM → 角色 → 创建角色选择AWS服务 → “EC2”附加策略CloudWatchAgentServerPolicy角色名称CloudWatchAgent-Role点击创建角色3.2 绑定 IAM 角色到 EC2 实例进入 EC2 控制台 → 实例选中需要绑定的实例你的 3 台 K8s 节点操作 → 安全 → 修改 IAM 角色选择CloudWatchAgent-Role点击更新 IAM 角色3.3 验证 IAM 角色绑定成功curl-shttp://169.254.169.254/latest/meta-data/iam/security-credentials/CloudWatchAgent-Role|head-20期望输出返回包含AccessKeyId和SecretAccessKey的 JSON3.4 为 sms-sender 用户添加权限# 附加 SNS 权限aws iam attach-user-policy\--user-name sms-sender\--policy-arn arn:aws:iam::aws:policy/AmazonSNSFullAccess# 附加 CloudWatch 日志权限aws iam attach-user-policy\--user-name sms-sender\--policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess# 附加 CloudWatch 告警权限aws iam attach-user-policy\--user-name sms-sender\--policy-arn arn:aws:iam::aws:policy/CloudWatchFullAccess# 附加 Lambda 权限aws iam attach-user-policy\--user-name sms-sender\--policy-arn arn:aws:iam::aws:policy/AWSLambda_FullAccess4. 第二步部署 Fluent Bit 采集器4.1 创建部署脚本创建deploy-fluentbit.sh#!/bin/bash# deploy-fluentbit.sh - 部署 Fluent Bit 日志采集器# 在 K8s Master 节点执行set-eechoecho部署 Fluent Bit 日志采集器echo# 设置变量AWS_ACCOUNT_ID997836553899REGIONap-east-1ROLE_NAMECloudWatchAgent-RoleCLUSTER_NAMEk8s# 1. 创建命名空间echo 创建命名空间 amazon-cloudwatchkubectl create namespace amazon-cloudwatch --dry-runclient-oyaml|kubectl apply-f-# 2. 创建 ServiceAccountecho 创建 ServiceAccountcatEOF|kubectl apply-f-apiVersion: v1 kind: ServiceAccount metadata: name: fluent-bit-sa namespace: amazon-cloudwatch annotations: eks.amazonaws.com/role-arn: arn:aws:iam::$AWS_ACCOUNT_ID:role/$ROLE_NAMEEOF# 3. 创建 ConfigMapecho 创建 Fluent Bit 配置catEOF|kubectl apply-f-apiVersion: v1 kind: ConfigMap metadata: name: fluent-bit-config namespace: amazon-cloudwatch labels: k8s-app: fluent-bit data: fluent-bit.conf: | [SERVICE] Flush 5 Log_Level info Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 [INPUT] Name tail Path /var/log/containers/*.log Parser cri Tag kube.* Refresh_Interval 10 Mem_Buf_Limit 50MB Skip_Long_Lines On [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_Tag_Prefix kube.var.log.containers. Merge_Log On Merge_Log_Key log_processed Keep_Log On K8S-Logging.Parser On K8S-Logging.Exclude Off Annotations Off Labels Off [OUTPUT] Name cloudwatch_logs Match kube.* region ap-east-1 log_group_name /aws/containerinsights/${CLUSTER_NAME}/application log_stream_prefix${CLUSTER_NAME}- auto_create_group true log_retention_days 30 parsers.conf: | [PARSER] Name cri Format regex Regex ^(?time[^ ]) (?streamstdout|stderr) (?logtag[^ ]*) (?message.*)$ Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%L%z EOF# 4. 部署 DaemonSetecho 部署 Fluent Bit DaemonSetcatEOF|kubectl apply-f-apiVersion: apps/v1 kind: DaemonSet metadata: name: fluent-bit namespace: amazon-cloudwatch labels: k8s-app: fluent-bit spec: selector: matchLabels: k8s-app: fluent-bit template: metadata: labels: k8s-app: fluent-bit spec: serviceAccountName: fluent-bit-sa hostNetwork: true dnsPolicy: ClusterFirstWithHostNet containers: - name: fluent-bit image: public.ecr.aws/aws-observability/aws-for-fluent-bit:3 imagePullPolicy: Always env: - name: REGION value: ${REGION} - name: CLUSTER_NAME value: ${CLUSTER_NAME} resources: requests: memory: 128Mi cpu: 50m limits: memory: 256Mi cpu: 100m volumeMounts: - name: varlog mountPath: /var/log - name: dockercontainers mountPath: /var/lib/docker/containers readOnly: true - name: fluent-bit-config mountPath: /fluent-bit/etc/ - name: runlog mountPath: /var/run volumes: - name: varlog hostPath: path: /var/log - name: dockercontainers hostPath: path: /var/lib/docker/containers - name: fluent-bit-config configMap: name: fluent-bit-config - name: runlog hostPath: path: /var/run tolerations: - operator: Exists EOFechoecho部署完成echoechoecho查看状态echokubectl get pods -n amazon-cloudwatchechoecho查看日志echokubectl logs -n amazon-cloudwatch -l k8s-appfluent-bit --tail304.2 执行部署chmodx deploy-fluentbit.sh ./deploy-fluentbit.sh4.3 验证部署# 查看 Pod 状态3 个节点都应该 Runningkubectl get pods-namazon-cloudwatch# 查看 Fluent Bit 日志kubectl logs-namazon-cloudwatch-lk8s-appfluent-bit--tail30期望输出3 个 Pod 状态为Running日志显示[ info] [output:cloudwatch_logs]和[ info] [input:tail]5. 第三步创建 SNS 主题5.1 创建 SNS 主题aws sns create-topic\--nameK8s-DingTalk-Alarm\--regionap-east-1输出示例{TopicArn:arn:aws:sns:ap-east-1:997836553899:K8s-DingTalk-Alarm}5.2 记录 Topic ARN保存输出中的TopicArn后续步骤需要用到。6. 第四步创建 Lambda 函数钉钉通知6.1 在 AWS 控制台创建 Lambda进入 Lambda → 创建函数选择从头开始创作函数名称DingTalkNotifier运行时Python 3.9点击创建函数6.2 粘贴 Lambda 代码importjsonimporturllib3importos httpurllib3.PoolManager()# 从环境变量读取 Webhook 地址WEBHOOK_URLos.environ.get(WEBHOOK_URL)deflambda_handler(event,context):# 解析 SNS 消息try:sns_messageevent[Records][0][Sns][Message]alarm_datajson.loads(sns_message)exceptExceptionase:print(f解析消息失败:{e})return{statusCode:400,body:Invalid message}# 提取告警信息alarm_namealarm_data.get(AlarmName,未知告警)alarm_statealarm_data.get(NewStateValue,未知)alarm_reasonalarm_data.get(NewStateReason,无详细信息)timestampalarm_data.get(StateChangeTime,未知时间)# 构建钉钉消息包含 ERROR 关键字满足关键词安全策略dingtalk_msg{msgtype:markdown,markdown:{title: K8s 日志告警,text:f ### K8s 日志告警 - **告警名称**:{alarm_name}- **状态**: **{alarm_state}** - **时间**:{timestamp}- **详情**:{alarm_reason} 关键字: ERROR },at:{isAtAll:False}}try:responsehttp.request(POST,WEBHOOK_URL,bodyjson.dumps(dingtalk_msg).encode(utf-8),headers{Content-Type:application/json})print(f钉钉响应状态:{response.status})print(f钉钉响应内容:{response.data})return{statusCode:response.status,body:json.dumps({message:Message sent to DingTalk})}exceptExceptionase:print(f发送失败:{e})return{statusCode:500,body:str(e)}重要粘贴后必须点击“Deploy”按钮保存。6.3 配置环境变量Lambda → 配置 → 环境变量添加环境变量键WEBHOOK_URL值你的钉钉机器人 Webhook 地址6.4 添加 SNS 触发器点击添加触发器选择 “SNS”选择现有主题K8s-DingTalk-Alarm点击添加7. 第五步创建告警规则7.1 创建告警脚本创建create-alarm.sh#!/bin/bash# create-alarm.sh - 创建 ERROR 日志告警LOG_GROUP/aws/containerinsights/k8s/applicationREGIONap-east-1SNS_TOPIC_ARNarn:aws:sns:ap-east-1:997836553899:K8s-DingTalk-Alarmechoecho创建 CloudWatch 日志告警echo# 1. 创建指标筛选器echo 创建 ERROR 指标筛选器...aws logs put-metric-filter\--log-group-name$LOG_GROUP\--filter-nameErrorCount\--filter-patternERROR\--metric-transformationsmetricNameErrorCount,metricNamespaceK8sLogs,metricValue1\--region$REGIONif[$?-eq0];thenecho✅ 指标筛选器创建成功elseecho❌ 指标筛选器创建失败fi# 2. 创建告警echo 创建告警规则...aws cloudwatch put-metric-alarm\--alarm-nameK8s-Error-DingTalk\--alarm-description当 5 分钟内 ERROR 日志超过 5 条时触发钉钉告警\--metric-nameErrorCount\--namespaceK8sLogs\--statisticSum\--period300\--evaluation-periods1\--threshold5\--comparison-operatorGreaterThanThreshold\--alarm-actions$SNS_TOPIC_ARN\--region$REGIONif[$?-eq0];thenecho✅ 告警规则创建成功elseecho❌ 告警规则创建失败fiechoecho告警配置完成echo7.2 执行告警创建chmodx create-alarm.sh ./create-alarm.sh7.3 验证告警创建# 验证指标筛选器aws logs describe-metric-filters\--log-group-name/aws/containerinsights/k8s/application\--filter-name-prefixErrorCount\--regionap-east-1# 验证告警aws cloudwatch describe-alarms\--alarm-namesK8s-Error-DingTalk\--regionap-east-18. 第六步测试告警8.1 生成 ERROR 日志# 生成 6 条 ERROR 日志超过阈值 5foriin{1..6};dokubectl run test-error-$i--imagebusybox--restartNever--rm-it--sh-cecho ERROR: test alarm message #$i2/dev/null||truesleep2done8.2 验证指标数据aws cloudwatch get-metric-statistics\--namespaceK8sLogs\--metric-name ErrorCount\--start-time$(date-u-d10 minutes ago%Y-%m-%dT%H:%M:%SZ)\--end-time$(date-u%Y-%m-%dT%H:%M:%SZ)\--period300\--statisticsSum\--regionap-east-1期望输出返回Sum值大于 08.3 验证钉钉收到告警等待 3-5 分钟钉钉群应收到告警消息 K8s 日志告警 - 告警名称: K8s-Error-DingTalk - 状态: ALARM - 时间: 2026-07-01T03:12:10.4040000 - 详情: Threshold Crossed: 1 datapoint [26.0 ...] was greater than the threshold (5.0). 关键字: ERROR9. Logs Insights 查询指南9.1 查看所有日志流了解有哪些服务SOURCE/aws/containerinsights/k8s/application|statscount()bylogStream|sort countdesc|limit509.2 查看特定服务日志SOURCE/aws/containerinsights/k8s/application|filterlogStreamlike/gateway/|fieldstimestamp,message|sorttimestampdesc|limit1009.3 查看 ERROR 日志SOURCE/aws/containerinsights/k8s/application|filtermessagelike/ERROR/|fieldstimestamp,message,logStream|sorttimestampdesc|limit1009.4 查看特定命名空间日志SOURCE/aws/containerinsights/k8s/application|filterlogStreamlike/k8s/|fieldstimestamp,message,logStream|sorttimestampdesc|limit10010. 常见问题与解决方案10.1 IAM 权限问题错误信息解决方案User is not authorized to perform: SNS:CreateTopic添加AmazonSNSFullAccess策略到用户User is not authorized to perform: logs:PutMetricFilter添加CloudWatchLogsFullAccess策略User is not authorized to perform: cloudwatch:PutMetricAlarm添加CloudWatchFullAccess策略User is not authorized to perform: cloudwatch:DescribeAlarms添加CloudWatchFullAccess或cloudwatch:DescribeAlarms权限10.2 Fluent Bit 部署问题问题解决方案ErrImagePull/ImagePullBackOff使用public.ecr.aws/aws-observability/aws-for-fluent-bit:3替代amazon/cloudwatch-agent:1.300042.0could not allocate key value pairFluent Bit 5.x 配置语法更严格使用简化配置不带INCLUDE的子配置文件could not get meta for POD这是控制平面节点的警告不影响业务 Pod可以忽略10.3 CloudWatch 查询问题错误信息解决方案MalformedQueryException检查 Logs Insights 语法使用filter logStream like /xxx/而非filter logStream xxxA log group must be selected在 Logs Insights 中选择正确的日志组kubernetes.container_name字段为空使用logStream替代日志流名称已包含 Pod 和容器信息10.4 钉钉告警问题问题解决方案钉钉收不到消息1. 检查 Lambda 是否被触发查看调用次数2. 检查 Lambda 环境变量WEBHOOK_URL是否正确3. 检查钉钉机器人关键词是否包含ERRORLambda 触发但钉钉无响应查看 Lambda CloudWatch 日志检查urllib3错误告警不触发1. 检查指标筛选器是否创建成功2. 等待 3-5 分钟让数据聚合3. 查看指标数据get-metric-statistics11. 附录11.1 环境变量汇总变量值AWS_ACCOUNT_ID997836553899AWS_REGIONap-east-1CLUSTER_NAMEk8sLOG_GROUP/aws/containerinsights/k8s/applicationSNS_TOPIC_ARNarn:aws:sns:ap-east-1:997836553899:K8s-DingTalk-AlarmIAM_ROLECloudWatchAgent-Role11.2 关键命令速查# 查看 Fluent Bit Podkubectl get pods-namazon-cloudwatch# 查看 Fluent Bit 日志kubectl logs-namazon-cloudwatch-lk8s-appfluent-bit--tail30# 重启 Fluent Bitkubectl rollout restart daemonset fluent-bit-namazon-cloudwatch# 查看告警状态aws cloudwatch describe-alarms --alarm-namesK8s-Error-DingTalk--regionap-east-1# 查看指标数据aws cloudwatch get-metric-statistics--namespaceK8sLogs --metric-name ErrorCount --start-time$(date-u-d10 minutes ago%Y-%m-%dT%H:%M:%SZ)--end-time$(date-u%Y-%m-%dT%H:%M:%SZ)--period300--statisticsSum--regionap-east-1# 查看日志流aws logs describe-log-streams --log-group-name/aws/containerinsights/k8s/application--regionap-east-1 --max-items1011.3 相关文档链接CloudWatch Logs 文档Fluent Bit Kubernetes 过滤器CloudWatch Logs Insights 查询语法钉钉自定义机器人文档12. 版本历史版本日期作者变更说明v1.02026-07-01-初始版本基于实际部署经验整理文档结束如果有任何问题或需要补充的内容请随时提出