模型评价之提升图(Lift Chart)绘制

时间:2025/9/20 3:57:09来源：https://blog.csdn.net/weixin_37522117/article/details/141716316 浏览次数:0次

提升图（Lift Chart）是用于评估分类模型（例如逻辑回归）性能的图形工具。它展示了模型在不同阈值下的表现，特别是模型在预测正类（通常是少数类）时的效果如何。

在这里插入图片描述

⭐️ 什么是提升图？

提升（Lift）：
提升表示模型预测正类（例如某种病症、购买行为等）的准确性相比随机猜测的改善程度。它通常是模型精度与基准精度的比值。基准精度通常是在没有使用模型的情况下预测正类的概率。
提升图的构建 ：

首先，将模型预测的概率按降序排序。
然后，将数据集分为若干个等分的区间（如10个或100个分组）。
对每个分组计算实际的正类比例（True Positive Rate, TPR），并与基准精度进行比较，计算提升。

提升图的解读 ：

图形的X轴通常表示样本的累积比例，Y轴表示累积提升。
在理想情况下，提升图的曲线应该在最初的几个区间表现出显著提升（即Y值较高），然后逐渐接近于1。这表明模型能够较早地识别出大量正类样本。

⭐️ 逻辑回归提升图的意义

逻辑回归提升图可用于评估模型的性能，特别是在目标类别不平衡时，它能帮助我们识别模型在识别少数类时的效果。

该图有助于我们理解模型在不同阈值下的表现，特别是识别正类的能力。

这个逻辑回归提升图展示了在模型输出的不同分位数（通常是将预测概率分段）下，模型在识别目标类别（例如，正类）的效果。

⭐️ 提升图的绘制过程一般如下：

模型训练与预测：
- 使用逻辑回归或其他二元分类模型对数据进行训练。
- 获取测试集或验证集上的预测概率。
分段（Population Segmentation）：
- 将样本按照模型预测的概率从高到低排序。
- 按照一定比例将样本分为若干个区间（如10个区间，每个区间包含10%的样本）。
- 每个区间内计算模型预测为目标类别的样本数量（即条形图的高度），以及这些预测的准确率。
计算累积提升：
- 累积覆盖率（Cumulative Coverage of Yes）：表示到当前区间为止，目标类别样本在所有目标类别中的覆盖比例。这个值通常随着分区的累积而增加。
- 准确率（Correct in Confidence Segment）：表示在该区间内模型预测的准确性。
绘制图形：
- X轴：表示样本的累积百分比（从0%到100%）。
- Y轴：表示目标类别的占比或准确率。
- 条形图：每个条形图表示一个区间的准确率，即该区间内正确分类为目标类别的样本比例。
- 折线图：表示累积覆盖率的变化，展示模型随着区间累积对目标类别的识别效果。

本文对多种分类模型的Litf Chart 进行了绘制，代码如下：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVCmodel_name_dict = {'LogisticRegression': LogisticRegression(),'DecisionTreeClassifier': DecisionTreeClassifier(),'RandomForestClassifier': RandomForestClassifier(),'GaussianNB': GaussianNB(),'KNeighborsClassifier': KNeighborsClassifier(),'MLPClassifier': MLPClassifier(),'SVC': SVC(probability=True)
}# 绘制提升图
def draw_lift_chart(lift_table, model_name):fig, ax1 = plt.subplots(figsize=(10, 6))ax2 = ax1.twinx()ax1.bar(lift_table['bin'], lift_table['precision'], color='lightgreen', width=0.4, label='Correct in Confidence Segment')ax2.plot(lift_table['bin'], lift_table['cum_perc'], color='blue', marker='o', label='Cumulative Coverage of Yes')ax1.set_xlabel('Population %')ax1.set_ylabel('Target %')ax2.set_ylabel('Cumulative Coverage %')ax1.set_xticks(lift_table['bin'])ax1.set_xticklabels([f'{i * 10}%' for i in lift_table['bin']])plt.title(f'{model_name} - Lift Chart')fig.tight_layout()plt.legend()# plt.show()plt.savefig(f'./{model_name}_lift_chart.png', dpi=800)if __name__ == '__main__':# 准备数据，二分类X, y = datasets.load_breast_cancer(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)for model_name, model in model_name_dict.items():# 训练模型model.fit(X_train, y_train)# 预测, 输出概率y_probs = model.predict_proba(X_test)[:, 1]# 概率和标签true_probs = pd.DataFrame({'y_true': y_test, 'y_prob': y_probs})# 按预测概率排序true_probs = true_probs.sort_values(by='y_prob', ascending=False).reset_index(drop=True)# 分段idx = true_probs.index.valuespercentiles = np.percentile(idx.tolist() + [true_probs.shape[0]], np.arange(0, 101, 10))bins = np.digitize(idx, percentiles)true_probs['bin'] = bins# 计算每个区间的准确率lift_table = true_probs.groupby('bin').agg({'y_true': ['sum', 'count'],  # 正例数和样本总数'y_prob': 'mean'  # 平均预测概率}).reset_index()lift_table.columns = ['bin', 'pos', 'count', 'mean_prob']lift_table['precision'] = lift_table['pos'] / lift_table['count']lift_table['cum_pos'] = lift_table['pos'].cumsum()lift_table['cum_perc'] = lift_table['cum_pos'] / lift_table['pos'].sum()draw_lift_chart(lift_table, model_name)

运行结果如下
在这里插入图片描述

在这里插入图片描述

笔者水平有限，若有不对的地方欢迎评论指正！