Python 中的随机森林:数据转换警告或奇怪的结果?

Random Forest in Python: Either Data Conversion warning or weird result?

我目前正在尝试 运行 房屋销售数据集上的随机森林算法。

不幸的是,我正在努力将 X 和 Y 变量放入正确的维度。

起初,我只想包含 4 个特征(如浴室、卧室、平方英尺...)来预测价格(第一列)。

如果我运行像下面这样的代码,我会得到以下错误:

DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

这是一个非常明确的陈述,所以我通过 ravel 转换了我的 y 变量 (Train_TargetVar):

Train_TargetVar = np.ravel(Train_TargetVar, order='C')

代码 运行 到现在为止,但没有任何意义。 最后的混淆矩阵是这样的:

 Confusion matrix  
[[1 0 0 ..., 0 0 0]
 [0 1 0 ..., 0 0 0]
 [0 0 1 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 1 0 0]
 [0 0 0 ..., 0 1 0]
 [0 0 0 ..., 0 1 0]]

恐怕现在有几千行/列 - 没有有意义的结果...

如果有人能给我提示就太好了 and/or 告诉我我的代码的哪一部分必须更改。

# Load Libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np



dataset= pd.read_csv('kc_house_data.csv')


dataset = dataset.drop('id', axis=1)
dataset = dataset.drop('date', axis=1)
dataset = dataset.drop('zipcode', axis=1)
dataset = dataset.drop('long', axis=1)



cols = ['price', 'bathrooms', 'floors', 'bedrooms', 'sqft_living', 'sqft_lot', 'waterfront', 'view', 'condition', 'grade', 'lat', 'sqft_above']
dataset[cols] = dataset[cols].applymap(np.int64)





# Splitting Dataset
Train,Test = train_test_split(dataset, test_size = 0.3, random_state = 176)



Train_IndepentVars = Train.values[:, 1:5]
Train_TargetVar = Train.values[:, 0:1]


print(Train_IndepentVars.shape)
print(Train_TargetVar.shape)



##RF

rf_model =  RandomForestClassifier(max_depth=30,n_estimators=5)
rf_model.fit(Train_IndepentVars, Train_TargetVar)

predictions = rf_model.predict(Train_IndepentVars)




###Confusion Matrix

from sklearn.metrics import confusion_matrix

print(" Confusion matrix ", confusion_matrix(Train_TargetVar, predictions))


importance =  rf_model.feature_importances_
importance = pd.DataFrame(importance, index=Train.columns[1:5], 
                          columns=["Importance"])

print(importance)

您不需要使用 ravel,它会立即为您带来相同的一维数组:

Train_TargetVar = Train.values[:, 0]

首先,老实说,您看到的警告可以忽略,与您的结果无关。

您的代码按预期工作,但您需要修改您的方法。您所做的是尝试将分类器拟合到回归问题(作为目标变量当前所在的位置),即您正在使用 RandomForestClassifier 来预测 price,其中 price 可以采用在 4028 个不同的值上。分类器会将其视为 4028 类 [0, 1, 2, 3, 4, 5, ...., 4027]。这就是为什么你会看到疯狂大小的混淆矩阵,它是一个 4028x4028 矩阵。

相反,您应该:

  1. 将其转化为回归问题
  2. 修改您的目标变量,使其成为可管理的分类问题

1.回归问题

这是您的代码,修改为使用 RandomForestRegressor,然后使用 R 平方指标查看其性能

import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

df = pd.read_csv('data/kc_house_data.csv')

drop_cols = ['id', 'date', 'zipcode', 'long']
feature_cols = ['bathrooms', 'floors', 'bedrooms', 
             'sqft_living', 'sqft_lot', 'waterfront', 'view', 
             'condition', 'grade', 'lat', 'sqft_above']
target_col = ['price']

all_cols = feature_cols + target_col

dataset = df.drop(drop_cols, axis=1)
dataset[all_cols] = dataset[all_cols].applymap(np.int64)

# split dataset for cross-validation
train, test = train_test_split(dataset, test_size=0.3, random_state=176)

# set up our random forest model
rf = RandomForestRegressor(max_depth=30, n_estimators=5)

# fit our model
rf.fit(train[feature_cols].values, train[target_col].values )

# look at our predictions
y_pred = rf.predict(test[feature_cols].values)
r2 = r2_score(y_pred, test[target_col].values)
print('R-squared: {}'.format(r2))

# look at the feature importance
importance =  rf.feature_importances_
importance = pd.DataFrame(importance, index=feature_cols, columns=['importance'])
print('Feature importance:\n {}'.format(importance))

这将打印出:

R-squared: 0.532308123273

Feature importance:
              importance
bathrooms      0.016268
floors         0.010330
bedrooms       0.017346
sqft_living    0.422269
sqft_lot       0.104096
waterfront     0.021439
view           0.037015
condition      0.025751
grade          0.279991
lat            0.000000
sqft_above     0.065496

您可能已经知道,但为了清楚起见,您可以阅读以下内容:

2。分类问题

您可以通过创建一些价格区间来创建您自己的 类。例如,桶 1:0 - 100,000 美元,桶 2:100,000 美元 - 200,000 美元,依此类推。现在,我只给了它两个 类:

# fake price classes, this is for later
target_binary = ['price_binary']
dataset[target_binary] = (dataset[target_col] > 221900).astype(int)

然后使用 RandomForestClassifier:

# Using a Random Forest Classifier
# fit the classifier
rfc = RandomForestClassifier(max_depth=30, n_estimators=5)
rfc.fit(train[feature_cols], train[target_binary])

# look at the feature importance
importance_c =  rfc.feature_importances_
importance_c = pd.DataFrame(importance_c, index=feature_cols, columns=['importance'])
print('Feature importance:\n {}'.format(importance))

# look at our predictions
y_pred_c = rfc.predict(test[feature_cols])
cm = confusion_matrix(y_pred_c, test[target_binary])
print('Confusion matrix:\n {}'.format(cm))

这将打印以下内容:

Feature importance:
              importance
bathrooms      0.018511
floors         0.011572
bedrooms       0.019063
sqft_living    0.455199
sqft_lot       0.113200
waterfront     0.026671
view           0.030930
condition      0.021197
grade          0.235906
lat            0.000000
sqft_above     0.067749

Confusion matrix:
 [[  92  155]
 [ 324 5913]]