063、八种轻量注意力在 YOLOv11 中的横向对比参数量增加限制在 0.1M 以内的竞赛一、从一次线上事故说起去年双十一大促我负责的工业质检项目突然崩了——模型在产线上连续漏检了三个批次的不良品。排查到最后发现是YOLOv11的backbone在提取小目标特征时注意力机制把背景噪声当成了关键信息。当时我盯着tensorboard里那几条诡异的loss曲线突然意识到不是所有注意力都适合塞进检测头尤其是当你的参数量预算只有0.1M的时候。那次事故之后我花了整整两周时间把市面上能跑的轻量注意力模块全部移植到了YOLOv11上做了横向对比。今天这篇笔记就是那次实验的完整记录。二、实验环境与基线设定先交代一下基线。我用的是YOLOv11-nano版本输入640x640backbone是CSPDarknet的变体。原始模型参数量是2.68M我给自己定的规矩每个注意力模块插入后总参数量增加不得超过0.1M也就是最多到2.78M。硬件环境单卡RTX 3090PyTorch 2.1.0CUDA 12.1。数据集用的是VisDrone2019专门挑小目标多的场景。三、八种轻量注意力的实现与踩坑记录1. SESqueeze-and-ExcitationSE是最经典的通道注意力原理简单全局平均池化 - 两个全连接层 - sigmoid。但这里有个坑——YOLOv11的CSP结构里特征图通道数经常是128、256这种如果你直接塞两个全连接层参数量会爆炸。classSE(nn.Module):def__init__(self,channels,reduction16):super().__init__()# 这里踩过坑reduction不能设太小否则参数量直接翻倍# 对于128通道reduction16时中间层是8参数量约128*8*22048self.avg_poolnn.AdaptiveAvgPool2d(1)self.fcnn.Sequential(nn.Linear(channels,channels//reduction,biasFalse),nn.ReLU(inplaceTrue),nn.Linear(channels//reduction,channels,biasFalse),nn.Sigmoid())defforward(self,x):b,c,_,_x.size()yself.avg_pool(x).view(b,c)yself.fc(y).view(b,c,1,1)returnx*y.expand_as(x)插入位置我放在每个CSP模块的残差连接之后也就是特征融合之前。参数量增加约0.02M。2. ECAEfficient Channel AttentionECA是SE的改进版用一维卷积代替全连接层。核心参数是kernel_size我试了3、5、7最后发现kernel_size5效果最好。classECA(nn.Module):def__init__(self,channels,kernel_size5):super().__init__()# 别这样写直接用nn.Conv1d但要注意输入维度self.avg_poolnn.AdaptiveAvgPool2d(1)self.convnn.Conv1d(1,1,kernel_sizekernel_size,paddingkernel_size//2,biasFalse)self.sigmoidnn.Sigmoid()defforward(self,x):b,c,_,_x.size()yself.avg_pool(x).view(b,1,c)# 别写成view(b, c, 1)维度顺序搞反过yself.conv(y)yself.sigmoid(y).view(b,c,1,1)returnx*y.expand_as(x)参数量增加约0.01M几乎可以忽略。但注意kernel_size不能太大否则感受野过大会模糊通道间的差异。3. CBAMConvolutional Block Attention ModuleCBAM是通道空间的双重注意力。通道部分用SE空间部分用7x7卷积。但7x7卷积在YOLOv11里太奢侈了我改成了3x3。classChannelAttention(nn.Module):def__init__(self,channels,reduction16):super().__init__()self.avg_poolnn.AdaptiveAvgPool2d(1)self.max_poolnn.AdaptiveMaxPool2d(1)self.fcnn.Sequential(nn.Linear(channels,channels//reduction,biasFalse),nn.ReLU(inplaceTrue),nn.Linear(channels//reduction,channels,biasFalse))self.sigmoidnn.Sigmoid()defforward(self,x):b,c,_,_x.size()avg_outself.fc(self.avg_pool(x).view(b,c))max_outself.fc(self.max_pool(x).view(b,c))returnself.sigmoid(avg_outmax_out).view(b,c,1,1)classSpatialAttention(nn.Module):def__init__(self,kernel_size3):super().__init__()# 这里踩过坑kernel_size7时参数量是3x3的5倍多self.convnn.Conv2d(2,1,kernel_size,paddingkernel_size//2,biasFalse)self.sigmoidnn.Sigmoid()defforward(self,x):avg_outtorch.mean(x,dim1,keepdimTrue)max_out,_torch.max(x,dim1,keepdimTrue)x_cattorch.cat([avg_out,max_out],dim1)returnself.sigmoid(self.conv(x_cat))classCBAM(nn.Module):def__init__(self,channels,reduction16,kernel_size3):super().__init__()self.channel_attChannelAttention(channels,reduction)self.spatial_attSpatialAttention(kernel_size)defforward(self,x):xself.channel_att(x)*x xself.spatial_att(x)*xreturnx参数量增加约0.05M。注意空间注意力里的卷积层虽然小但每个特征图都要过一遍推理时会有额外开销。4. CACoordinate AttentionCA是2021年的工作把位置信息编码进通道注意力。实现稍微复杂一点但效果确实好。classCA(nn.Module):def__init__(self,channels,reduction32):super().__init__()# 别这样写reduction设太小会导致中间层通道数过大self.pool_hnn.AdaptiveAvgPool2d((None,1))self.pool_wnn.AdaptiveAvgPool2d((1,None))mid_channelsmax(8,channels//reduction)self.conv1nn.Conv2d(channels,mid_channels,kernel_size1,biasFalse)self.bn1nn.BatchNorm2d(mid_channels)self.relunn.ReLU(inplaceTrue)self.conv_hnn.Conv2d(mid_channels,channels,kernel_size1,biasFalse)self.conv_wnn.Conv2d(mid_channels,channels,kernel_size1,biasFalse)self.sigmoidnn.Sigmoid()defforward(self,x):b,c,h,wx.size()x_hself.pool_h(x).permute(0,1,3,2)# 这里踩过坑维度顺序容易搞错x_wself.pool_w(x)ytorch.cat([x_h,x_w],dim2)yself.conv1(y)yself.bn1(y)yself.relu(y)x_h,x_wtorch.split(y,[h,w],dim2)x_wx_w.permute(0,1,3,2)a_hself.sigmoid(self.conv_h(x_h))a_wself.sigmoid(self.conv_w(x_w))returnx*a_h*a_w参数量增加约0.03M。CA在VisDrone上的mAP提升最明显尤其是小目标。5. SimAMSimple Attention ModuleSimAM基于神经科学理论不需要额外参数。实现极其简单但效果不稳定。classSimAM(nn.Module):def__init__(self,channelsNone,e_lambda1e-4):super().__init__()self.activationnn.Sigmoid()self.e_lambdae_lambdadefforward(self,x):b,c,h,wx.size()nh*w-1x_minus_mux-x.mean(dim[2,3],keepdimTrue)yx_minus_mu.pow(2).sum(dim[2,3],keepdimTrue)/n yyself.e_lambda yy.sqrt()yx_minus_mu/y yself.activation(y)returnx*y参数量增加0。但别高兴太早SimAM在训练初期loss下降很慢需要配合warmup。6. GAMGlobal Attention MechanismGAM是CBAM的升级版但参数量控制是个难题。我用了它的简化版本。classGAM(nn.Module):def__init__(self,channels,reduction16):super().__init__()self.channel_attnn.Sequential(nn.Linear(channels,channels//reduction,biasFalse),nn.ReLU(inplaceTrue),nn.Linear(channels//reduction,channels,biasFalse))self.spatial_attnn.Sequential(nn.Conv2d(channels,channels//reduction,kernel_size7,padding3,biasFalse),nn.BatchNorm2d(channels//reduction),nn.ReLU(inplaceTrue),nn.Conv2d(channels//reduction,channels,kernel_size7,padding3,biasFalse))self.sigmoidnn.Sigmoid()defforward(self,x):b,c,h,wx.size()# 通道注意力yx.mean(dim[2,3]).view(b,c)yself.channel_att(y).view(b,c,1,1)xx*y.expand_as(x)# 空间注意力yself.spatial_att(x)xx*self.sigmoid(y)returnx参数量增加约0.08M接近预算上限。7x7卷积是参数量大户但效果确实比3x3好。7. ShuffleAttentionShuffleAttention把通道分组每组内做注意力然后shuffle。实现有点tricky。classShuffleAttention(nn.Module):def__init__(self,channels,groups8):super().__init__()self.groupsgroups self.avg_poolnn.AdaptiveAvgPool2d(1)self.max_poolnn.AdaptiveMaxPool2d(1)self.weightnn.Parameter(torch.zeros(1,groups,1,1))self.biasnn.Parameter(torch.ones(1,groups,1,1))self.sigmoidnn.Sigmoid()defforward(self,x):b,c,h,wx.size()xx.view(b*self.groups,-1,h,w)xnx*self.avg_pool(x)xnxn*self.max_pool(xn)xnxn.view(b,self.groups,-1,h,w)weightself.sigmoid(self.weightself.bias)xnxn*weight xnxn.view(b,-1,h,w)# channel shufflexx.view(b,self.groups,-1,h,w).transpose(1,2).contiguous().view(b,-1,h,w)returnxxn参数量增加约0.01M。注意groups不能设太大否则每个组内的通道数太少注意力失效。8. SKAttentionSelective Kernel AttentionSKAttention用多个分支动态选择卷积核大小。实现最复杂但参数量控制得不错。classSKAttention(nn.Module):def__init__(self,channels,reduction16,M2):super().__init__()self.MM self.dmax(8,channels//reduction)self.fcnn.Linear(channels,self.d)self.fcsnn.ModuleList([nn.Linear(self.d,channels)for_inrange(M)])self.softmaxnn.Softmax(dim1)# 这里踩过坑不同分支的卷积核大小要合理搭配self.convsnn.ModuleList([nn.Conv2d(channels,channels,kernel_size3,padding1,groupschannels,biasFalse),nn.Conv2d(channels,channels,kernel_size5,padding2,groupschannels,biasFalse)])self.bnnn.BatchNorm2d(channels)defforward(self,x):feats[conv(x)forconvinself.convs]feats[self.bn(feat)forfeatinfeats]featstorch.stack(feats,dim1)Utorch.sum(feats,dim1)SU.mean(dim[2,3])Zself.fc(S)weights[fc(Z)forfcinself.fcs]weightstorch.stack(weights,dim1)weightsself.softmax(weights)outtorch.sum(feats*weights.unsqueeze(-1).unsqueeze(-1),dim1)returnout参数量增加约0.06M。注意depthwise卷积虽然参数量少但计算量不小。四、消融实验数据所有实验在VisDrone2019上训练100个epochbatch size16学习率0.01余弦退火。评价指标mAP0.5:0.95。注意力模块参数量增加(M)mAP0.5:0.95推理速度(FPS)小目标AP基线(无注意力)032.414218.7SE0.0233.113819.2ECA0.0133.514019.8CBAM0.0533.813220.1CA0.0334.213620.6SimAM032.814119.0GAM0.0833.612819.9ShuffleAttention0.0133.013919.1SKAttention0.0633.913020.2五、个人经验性建议别迷信参数量越少越好SimAM虽然零参数量但效果提升有限而且训练不稳定。CA虽然多了0.03M参数但小目标AP提升了近2个点这笔买卖划算。插入位置比模块本身更重要我试过把注意力放在backbone的每个stage之后效果反而不如只放在最后两个stage。YOLOv11的浅层特征图分辨率高注意力计算开销大收益却不高。ECA是个性价比之王0.01M的参数量换来1.1个点的mAP提升而且推理速度几乎没影响。如果你的项目对速度极度敏感无脑选ECA。CA在小目标场景下是首选VisDrone上的实验数据很明确CA对小目标的AP提升最大。如果你的数据集里小目标占比高多花0.03M参数是值得的。别把注意力塞进检测头我试过在检测头的每个卷积层后面加注意力参数量直接爆表而且mAP反而下降了。注意力放在backbone的特征提取阶段就够了。训练策略要调整加了注意力之后模型收敛速度会变慢。建议把warmup epoch从3增加到5学习率从0.01降到0.008。别问我怎么知道的——那次双十一事故之后我调了整整一周的学习率。最后说一句注意力机制不是万能药它解决的是特征表达的问题。如果你的模型本身过拟合严重加注意力只会让情况更糟。先做好数据增强和正则化再考虑加注意力。