DAY 21 常见的降维算法
目录
- DAY 21 常见的降维算法
- 1.LDA线性判别
- 2.PCA主成分分析
- 3.t-sne降维
- 作业:自由作业:探索下什么时候用到降维?降维的主要应用?或者让ai给你出题,群里的同学互相学习下。可以考虑对比下在某些特定数据集上t-sne的可视化和pca可视化的区别。
DAY 21 常见的降维算法
from sklearn.manifold import TSNE
from mpl_toolkits.mplot3d import Axes3D
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.pipeline import Pipeline
import umap
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
import numpy as np
import warnings
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import time
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = Falsedata = pd.read_csv(r'data.csv')list_discrete = data.select_dtypes(include=['object']).columns.tolist()home_ownership_mapping = {'Own Home': 1, 'Rent': 2,'Have Mortgage': 3, 'Home Mortgage': 4}
data['Home Ownership'] = data['Home Ownership'].map(home_ownership_mapping)years_in_job_mapping = {'< 1 year': 1, '1 year': 2, '2 years': 3, '3 years': 4, '4 years': 5,'5 years': 6, '6 years': 7, '7 years': 8, '8 years': 9, '9 years': 10, '10+ years': 11}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)data = pd.get_dummies(data, columns=['Purpose'])
data2 = pd.read_csv(r'data.csv')
list_new = []
for i in data.columns:if i not in data2.columns:list_new.append(i)
for i in list_new:data[i] = data[i].astype(int)term_mapping = {'Short Term': 0, 'Long Term': 1}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term': 'Long Term'}, inplace=True)list_continuous = data.select_dtypes(include=['int64', 'float64']).columns.tolist()for i in list_continuous:median_value = data[i].median()data[i] = data[i].fillna(median_value)data.drop(columns=['Id'], inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 31 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Home Ownership 7500 non-null int64 1 Annual Income 7500 non-null float642 Years in current job 7500 non-null float643 Tax Liens 7500 non-null float644 Number of Open Accounts 7500 non-null float645 Years of Credit History 7500 non-null float646 Maximum Open Credit 7500 non-null float647 Number of Credit Problems 7500 non-null float648 Months since last delinquent 7500 non-null float649 Bankruptcies 7500 non-null float6410 Long Term 7500 non-null int64 11 Current Loan Amount 7500 non-null float6412 Current Credit Balance 7500 non-null float6413 Monthly Debt 7500 non-null float6414 Credit Score 7500 non-null float6415 Credit Default 7500 non-null int64 16 Purpose_business loan 7500 non-null int64 17 Purpose_buy a car 7500 non-null int64 18 Purpose_buy house 7500 non-null int64 19 Purpose_debt consolidation 7500 non-null int64 20 Purpose_educational expenses 7500 non-null int64 21 Purpose_home improvements 7500 non-null int64 22 Purpose_major purchase 7500 non-null int64 23 Purpose_medical bills 7500 non-null int64 24 Purpose_moving 7500 non-null int64 25 Purpose_other 7500 non-null int64 26 Purpose_renewable energy 7500 non-null int64 27 Purpose_small business 7500 non-null int64 28 Purpose_take a trip 7500 non-null int64 29 Purpose_vacation 7500 non-null int64 30 Purpose_wedding 7500 non-null int64
dtypes: float64(13), int64(18)
memory usage: 1.8 MB
X = data.drop(['Credit Default'], axis=1)
Y = data['Credit Default']X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
print('默认参数随机森林(训练集 -> 测试集)')start_time = time.time()
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, Y_train)
rf_pred = rf_model.predict(X_test)
end_time = time.time()print(f'训练与预测耗时: {end_time - start_time:.4f} 秒')
print('默认随机森林 在测试集上的分类报告:')
print(classification_report(Y_test, rf_pred))
print('默认随机森林 在测试集上的混淆矩阵:')
print(confusion_matrix(Y_test, rf_pred))
默认参数随机森林(训练集 -> 测试集)
训练与预测耗时: 3.0496 秒
默认随机森林 在测试集上的分类报告:precision recall f1-score support0 0.77 0.97 0.86 10591 0.79 0.30 0.44 441accuracy 0.77 1500macro avg 0.78 0.63 0.65 1500
weighted avg 0.77 0.77 0.73 1500默认随机森林 在测试集上的混淆矩阵:
[[1023 36][ 307 134]]
1.LDA线性判别
print(f'LDA 降维 + 随机森林')scaler_lda = StandardScaler()
X_train_scaled_lda = scaler_lda.fit_transform(X_train)
X_test_scaled_lda = scaler_lda.transform(X_test)
n_features = X_train_scaled_lda.shape[1]if hasattr(Y_train, 'nunique'):n_classes = Y_train.nunique()
elif isinstance(Y_train, np.ndarray):n_classes = len(np.unique(Y_train))
else:n_classes = len(set(Y_train))max_lda_components = min(n_features, n_classes - 1)
n_components_lda_target = 10if max_lda_components < 1:print(f'LDA 不适用, 因为类别数 ({n_classes})太少, 无法产生至少1个判别组件')X_train_lda = X_train_scaled_lda.copy()X_test_lda = X_test_scaled_lda.copy()actual_n_components_lda = n_featuresprint('将使用缩放后的原始特征进行后续操作')
else:actual_n_components_lda = min(n_components_lda_target, max_lda_components)if actual_n_components_lda < 1:print(f'计算得到的实际 LDA 组件数 ({actual_n_components_lda}) 小于1, LDA 不适用')X_train_lda = X_train_scaled_lda.copy()X_test_lda = X_test_scaled_lda.copy()actual_n_components_lda = n_featuresprint('将使用缩放后的原始特征进行后续操作')else:print(f'原始特征数: {n_features}, 类别数: {n_classes}')print(f'LDA 最多可降至 {max_lda_components} 维')print(f'目标降维维度: {n_components_lda_target} 维')print(f'本次 LDA 将实际降至 {actual_n_components_lda} 维')lda_manual = LinearDiscriminantAnalysis(n_components=actual_n_components_lda, solver='svd')X_train_lda = lda_manual.fit_transform(X_train_scaled_lda, Y_train)X_test_lda = lda_manual.transform(X_test_scaled_lda)print(f'LDA 降维后, 训练集形状: {X_train_lda.shape}, 测试集形状: {X_test_lda.shape}')start_time_lda_rf = time.time()
rf_model_lda = RandomForestClassifier(random_state=42)
rf_model_lda.fit(X_train_lda, Y_train)
rf_pred_lda_manual = rf_model_lda.predict(X_test_lda)
end_time_lda_rf = time.time()print(f'LDA 降维数据上, 随机森林训练与预测耗时: {end_time_lda_rf - start_time_lda_rf:.4f} 秒')
print('手动 LDA + 随机森林 在测试集上的分类报告:')
print(classification_report(Y_test, rf_pred_lda_manual))
print('手动 LDA + 随机森林 在测试集上的混淆矩阵:')
print(confusion_matrix(Y_test, rf_pred_lda_manual))
LDA 降维 + 随机森林
原始特征数: 30, 类别数: 2
LDA 最多可降至 1 维
目标降维维度: 10 维
本次 LDA 将实际降至 1 维
LDA 降维后, 训练集形状: (6000, 1), 测试集形状: (1500, 1)
LDA 降维数据上, 随机森林训练与预测耗时: 2.8816 秒
手动 LDA + 随机森林 在测试集上的分类报告:precision recall f1-score support0 0.76 0.76 0.76 10591 0.43 0.43 0.43 441accuracy 0.67 1500macro avg 0.60 0.60 0.60 1500
weighted avg 0.67 0.67 0.67 1500手动 LDA + 随机森林 在测试集上的混淆矩阵:
[[808 251][250 191]]
2.PCA主成分分析
print(f'PCA 降维 + 随机森林 (不使用 Pipeline)')scaler_pca = StandardScaler()
X_train_scaled_pca = scaler_pca.fit_transform(X_train)
X_test_scaled_pca = scaler_pca.transform(X_test)pca_expl = PCA(random_state=42)
pca_expl.fit(X_train_scaled_pca)
cumsum_variance = np.cumsum(pca_expl.explained_variance_ratio_)
n_components_to_keep_95_var = np.argmax(cumsum_variance >= 0.95) + 1
print(f'为了保留95%的方差, 需要的主成分数量: {n_components_to_keep_95_var}')
PCA 降维 + 随机森林 (不使用 Pipeline)
为了保留95%的方差, 需要的主成分数量: 26
n_components_pca = 10
pca_manual = PCA(n_components=n_components_pca, random_state=42)
X_train_pca = pca_manual.fit_transform(X_train_scaled_pca)
X_test_pca = pca_manual.transform(X_test_scaled_pca)print(f'PCA 降维后, 训练集形状: {X_train_pca.shape}, 测试集形状: {X_test_pca.shape}')start_time_pca_manual = time.time()
rf_model_pca = RandomForestClassifier(random_state=42)
rf_model_pca.fit(X_train_pca, Y_train)
rf_pred_pca_manual = rf_model_pca.predict(X_test_pca)
end_time_pca_manual = time.time()print(f'手动 PCA 降维后, 训练与预测耗时: {end_time_pca_manual - start_time_pca_manual:.4f} 秒')
print('手动 PCA + 随机森林 在测试集上的分类报告:')
print(classification_report(Y_test, rf_pred_pca_manual))
print('手动 PCA + 随机森林 在测试集上的混淆矩阵:')
print(confusion_matrix(Y_test, rf_pred_pca_manual))
PCA 降维后, 训练集形状: (6000, 10), 测试集形状: (1500, 10)
手动 PCA 降维后, 训练与预测耗时: 5.5574 秒
手动 PCA + 随机森林 在测试集上的分类报告:precision recall f1-score support0 0.76 0.94 0.84 10591 0.69 0.30 0.42 441accuracy 0.76 1500macro avg 0.73 0.62 0.63 1500
weighted avg 0.74 0.76 0.72 1500手动 PCA + 随机森林 在测试集上的混淆矩阵:
[[1000 59][ 308 133]]
3.t-sne降维
print(f't-SNE 降维 + 随机森林')
print('标准 t-SNE 主要用于可视化, 直接用于分类器输入可能效果不佳')scaler_tsne = StandardScaler()
X_train_scaled_tsne = scaler_tsne.fit_transform(X_train)
X_test_scaled_tsne = scaler_tsne.transform(X_test)
n_components_tsne = 2
tsne_model_train = TSNE(n_components=n_components_tsne, perplexity=30,n_iter=1000, init='pca', learning_rate='auto', random_state=42, n_jobs=-1)print('正在对训练集进行 t-SNE fit_transform')start_tsne_fit_train = time.time()
X_train_tsne = tsne_model_train.fit_transform(X_train_scaled_tsne)
end_tsne_fit_train = time.time()print(f'训练集 t-SNE fit_transform 完成, 耗时: {end_tsne_fit_train - start_tsne_fit_train:.2f} 秒')tsne_model_test = TSNE(n_components=n_components_tsne, perplexity=30,n_iter=1000, init='pca', learning_rate='auto', random_state=42, n_jobs=-1)print('正在对测试集进行 t-SNE fit_transform')start_tsne_fit_test = time.time()
X_test_tsne = tsne_model_test.fit_transform(X_test_scaled_tsne)
end_tsne_fit_test = time.time()print(f'测试集 t-SNE fit_transform 完成, 耗时: {end_tsne_fit_test - start_tsne_fit_test:.2f} 秒')
print(f't-SNE 降维后, 训练集形状: {X_train_tsne.shape}, 测试集形状: {X_test_tsne.shape}')start_time_tsne_rf = time.time()
rf_model_tsne = RandomForestClassifier(random_state=42)
rf_model_tsne.fit(X_train_tsne, Y_train)
rf_pred_tsne_manual = rf_model_tsne.predict(X_test_tsne)
end_time_tsne_rf = time.time()print(f't-SNE 降维数据上, 随机森林训练与预测耗时: {end_time_tsne_rf - start_time_tsne_rf:.4f} 秒')total_tsne_time = (end_tsne_fit_train - start_tsne_fit_train) + \(end_tsne_fit_test - start_tsne_fit_test) + \(end_time_tsne_rf - start_time_tsne_rf)print(f't-SNE 总耗时 (包括两次fit_transform和RF): {total_tsne_time:.2f} 秒')
print('手动 t-SNE + 随机森林 在测试集上的分类报告:')
print(classification_report(Y_test, rf_pred_tsne_manual))
print('手动 t-SNE + 随机森林 在测试集上的混淆矩阵:')
print(confusion_matrix(Y_test, rf_pred_tsne_manual))
t-SNE 降维 + 随机森林
标准 t-SNE 主要用于可视化, 直接用于分类器输入可能效果不佳
正在对训练集进行 t-SNE fit_transform
训练集 t-SNE fit_transform 完成, 耗时: 35.26 秒
正在对测试集进行 t-SNE fit_transform
测试集 t-SNE fit_transform 完成, 耗时: 7.46 秒
t-SNE 降维后, 训练集形状: (6000, 2), 测试集形状: (1500, 2)
t-SNE 降维数据上, 随机森林训练与预测耗时: 2.1959 秒
t-SNE 总耗时 (包括两次fit_transform和RF): 44.91 秒
手动 t-SNE + 随机森林 在测试集上的分类报告:precision recall f1-score support0 0.70 0.90 0.79 10591 0.24 0.07 0.11 441accuracy 0.66 1500macro avg 0.47 0.49 0.45 1500
weighted avg 0.57 0.66 0.59 1500手动 t-SNE + 随机森林 在测试集上的混淆矩阵:
[[955 104][408 33]]
作业:自由作业:探索下什么时候用到降维?降维的主要应用?或者让ai给你出题,群里的同学互相学习下。可以考虑对比下在某些特定数据集上t-sne的可视化和pca可视化的区别。
@浙大疏锦行