Scikit-learn 1.5.0 心脏病预测实战:5种分类算法调参与模型融合策略

📅 2026/7/5 2:39:01
Scikit-learn 1.5.0 心脏病预测实战:5种分类算法调参与模型融合策略
Scikit-learn 1.5.0 心脏病预测实战5种分类算法调参与模型融合策略1. 数据预处理与特征工程在开始建模之前我们需要对心脏病数据集进行全面的预处理。高质量的数据预处理往往能显著提升模型性能这比单纯调整模型参数更有效。数据清洗的关键步骤import pandas as pd from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer # 加载数据集 data pd.read_csv(heart_disease.csv) # 处理缺失值 num_imputer SimpleImputer(strategymedian) cat_imputer SimpleImputer(strategymost_frequent) # 数值型与类别型特征分离 numeric_features [age, trestbps, chol, thalach, oldpeak] categorical_features [sex, cp, fbs, restecg, exang, slope, ca, thal] # 构建预处理管道 preprocessor ColumnTransformer( transformers[ (num, StandardScaler(), numeric_features), (cat, OneHotEncoder(), categorical_features) ]) X data.drop(target, axis1) y data[target] # 应用预处理 X_processed preprocessor.fit_transform(X)特征选择技巧使用随机森林进行特征重要性评估from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt rf RandomForestClassifier(n_estimators100) rf.fit(X_processed, y) # 获取特征重要性 importances rf.feature_importances_ features numeric_features list(preprocessor.named_transformers_[cat].get_feature_names_out()) plt.figure(figsize(12,6)) plt.barh(features, importances) plt.title(Feature Importances) plt.show()提示在实际项目中建议保留重要性高于平均值的特征或者使用SelectFromModel进行自动选择。2. KNN算法调优实战KNN算法简单但效果不俗其性能高度依赖参数选择和距离度量方式。网格搜索寻找最优参数from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import GridSearchCV param_grid { n_neighbors: range(3, 21), weights: [uniform, distance], p: [1, 2] # 1:曼哈顿距离, 2:欧式距离 } knn KNeighborsClassifier() grid_search GridSearchCV(knn, param_grid, cv5, scoringf1, n_jobs-1) grid_search.fit(X_processed, y) print(f最佳参数: {grid_search.best_params_}) print(f最佳F1分数: {grid_search.best_score_:.3f})KNN算法优化技巧对高维数据考虑使用余弦相似度替代欧式距离使用KD树或Ball Tree加速近邻搜索对不平衡数据采用加权投票策略3. SVM参数优化与核技巧SVM在小样本高维数据上表现优异但参数选择至关重要。SVM调参实战from sklearn.svm import SVC from sklearn.model_selection import RandomizedSearchCV from scipy.stats import loguniform param_dist { C: loguniform(1e-2, 1e3), gamma: loguniform(1e-4, 1e1), kernel: [linear, rbf, poly] } svm SVC(probabilityTrue) random_search RandomizedSearchCV( svm, param_dist, n_iter50, cv5, scoringf1, random_state42, n_jobs-1 ) random_search.fit(X_processed, y) print(f最佳参数: {random_search.best_params_}) print(f最佳F1分数: {random_search.best_score_:.3f})不同核函数的适用场景核函数适用场景时间复杂度参数数量线性核特征多、样本少O(n)1 (C)RBF核非线性可分O(n²)2 (C, gamma)多项式核特征间交互O(n^d)3 (C, gamma, degree)4. 决策树与随机森林进阶调优决策树类模型直观易懂但容易过拟合需要精细调参。决策树调参策略from sklearn.tree import DecisionTreeClassifier dt_params { max_depth: [None, 5, 10, 15], min_samples_split: [2, 5, 10], min_samples_leaf: [1, 2, 4], max_features: [sqrt, log2, None] } dt DecisionTreeClassifier() grid_dt GridSearchCV(dt, dt_params, cv5, scoringf1, n_jobs-1) grid_dt.fit(X_processed, y)随机森林优化方案from sklearn.ensemble import RandomForestClassifier rf_params { n_estimators: [100, 200, 500], max_depth: [None, 10, 20], min_samples_split: [2, 5], bootstrap: [True, False] } rf RandomForestClassifier() grid_rf GridSearchCV(rf, rf_params, cv5, scoringf1, n_jobs-1) grid_rf.fit(X_processed, y)决策树剪枝技术对比预剪枝通过max_depth、min_samples_split等参数限制树生长后剪枝训练完整树后通过ccp_alpha参数剪枝代价复杂度剪枝使用cost_complexity_pruning_path寻找最优剪枝点5. 朴素贝叶斯与模型融合策略朴素贝叶斯虽然简单但在特定场景下表现优异特别是与其他模型融合时。不同朴素贝叶斯变体比较from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB models { GaussianNB: GaussianNB(), BernoulliNB: BernoulliNB(), MultinomialNB: MultinomialNB() } for name, model in models.items(): scores cross_val_score(model, X_processed, y, cv5, scoringf1) print(f{name}平均F1分数: {scores.mean():.3f})投票集成(VotingClassifier)实战from sklearn.ensemble import VotingClassifier # 定义基模型 estimators [ (knn, grid_search.best_estimator_), (svm, random_search.best_estimator_), (rf, grid_rf.best_estimator_), (nb, GaussianNB()) ] # 硬投票与软投票对比 for voting in [hard, soft]: voting_clf VotingClassifier(estimatorsestimators, votingvoting) scores cross_val_score(voting_clf, X_processed, y, cv5, scoringf1) print(f{voting} voting F1: {scores.mean():.3f})模型融合性能对比方法优点缺点适用场景投票法简单直观无法学习权重基模型差异大堆叠法性能提升明显实现复杂计算资源充足平均法减少方差可能降低精度同质模型6. 模型评估与部署建议交叉验证策略选择StratifiedKFold保持类别比例适用于分类问题GroupKFold确保同一组数据不分到训练和测试集TimeSeriesSplit时间序列数据专用关键评估指标from sklearn.metrics import classification_report final_model voting_clf.fit(X_processed, y) y_pred final_model.predict(X_processed) print(classification_report(y, y_pred, target_names[No Disease, Disease]))模型部署注意事项保存预处理管道和模型import joblib joblib.dump(preprocessor, preprocessor.pkl) joblib.dump(final_model, heart_disease_model.pkl)构建预测API服务from flask import Flask, request, jsonify import joblib app Flask(__name__) model joblib.load(heart_disease_model.pkl) preprocessor joblib.load(preprocessor.pkl) app.route(/predict, methods[POST]) def predict(): data request.json processed preprocessor.transform([data]) proba model.predict_proba(processed)[0][1] return jsonify({probability: proba}) if __name__ __main__: app.run(host0.0.0.0, port5000)在实际医疗应用中建议设置概率阈值并加入解释功能帮助医生理解模型决策依据。同时要定期监控模型性能当数据分布发生变化时及时重新训练。