Python 中的随机森林:数据转换警告或奇怪的结果?
Random Forest in Python: Either Data Conversion warning or weird result?
我目前正在尝试 运行 房屋销售数据集上的随机森林算法。
不幸的是,我正在努力将 X 和 Y 变量放入正确的维度。
起初,我只想包含 4 个特征(如浴室、卧室、平方英尺...)来预测价格(第一列)。
如果我运行像下面这样的代码,我会得到以下错误:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
这是一个非常明确的陈述,所以我通过 ravel 转换了我的 y 变量 (Train_TargetVar):
Train_TargetVar = np.ravel(Train_TargetVar, order='C')
代码 运行 到现在为止,但没有任何意义。
最后的混淆矩阵是这样的:
Confusion matrix
[[1 0 0 ..., 0 0 0]
[0 1 0 ..., 0 0 0]
[0 0 1 ..., 0 0 0]
...,
[0 0 0 ..., 1 0 0]
[0 0 0 ..., 0 1 0]
[0 0 0 ..., 0 1 0]]
恐怕现在有几千行/列 - 没有有意义的结果...
如果有人能给我提示就太好了 and/or 告诉我我的代码的哪一部分必须更改。
# Load Libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np
dataset= pd.read_csv('kc_house_data.csv')
dataset = dataset.drop('id', axis=1)
dataset = dataset.drop('date', axis=1)
dataset = dataset.drop('zipcode', axis=1)
dataset = dataset.drop('long', axis=1)
cols = ['price', 'bathrooms', 'floors', 'bedrooms', 'sqft_living', 'sqft_lot', 'waterfront', 'view', 'condition', 'grade', 'lat', 'sqft_above']
dataset[cols] = dataset[cols].applymap(np.int64)
# Splitting Dataset
Train,Test = train_test_split(dataset, test_size = 0.3, random_state = 176)
Train_IndepentVars = Train.values[:, 1:5]
Train_TargetVar = Train.values[:, 0:1]
print(Train_IndepentVars.shape)
print(Train_TargetVar.shape)
##RF
rf_model = RandomForestClassifier(max_depth=30,n_estimators=5)
rf_model.fit(Train_IndepentVars, Train_TargetVar)
predictions = rf_model.predict(Train_IndepentVars)
###Confusion Matrix
from sklearn.metrics import confusion_matrix
print(" Confusion matrix ", confusion_matrix(Train_TargetVar, predictions))
importance = rf_model.feature_importances_
importance = pd.DataFrame(importance, index=Train.columns[1:5],
columns=["Importance"])
print(importance)
您不需要使用 ravel,它会立即为您带来相同的一维数组:
Train_TargetVar = Train.values[:, 0]
首先,老实说,您看到的警告可以忽略,与您的结果无关。
您的代码按预期工作,但您需要修改您的方法。您所做的是尝试将分类器拟合到回归问题(作为目标变量当前所在的位置),即您正在使用 RandomForestClassifier
来预测 price
,其中 price
可以采用在 4028 个不同的值上。分类器会将其视为 4028 类 [0, 1, 2, 3, 4, 5, ...., 4027]。这就是为什么你会看到疯狂大小的混淆矩阵,它是一个 4028x4028 矩阵。
相反,您应该:
- 将其转化为回归问题
- 修改您的目标变量,使其成为可管理的分类问题
1.回归问题
这是您的代码,修改为使用 RandomForestRegressor
,然后使用 R 平方指标查看其性能
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
df = pd.read_csv('data/kc_house_data.csv')
drop_cols = ['id', 'date', 'zipcode', 'long']
feature_cols = ['bathrooms', 'floors', 'bedrooms',
'sqft_living', 'sqft_lot', 'waterfront', 'view',
'condition', 'grade', 'lat', 'sqft_above']
target_col = ['price']
all_cols = feature_cols + target_col
dataset = df.drop(drop_cols, axis=1)
dataset[all_cols] = dataset[all_cols].applymap(np.int64)
# split dataset for cross-validation
train, test = train_test_split(dataset, test_size=0.3, random_state=176)
# set up our random forest model
rf = RandomForestRegressor(max_depth=30, n_estimators=5)
# fit our model
rf.fit(train[feature_cols].values, train[target_col].values )
# look at our predictions
y_pred = rf.predict(test[feature_cols].values)
r2 = r2_score(y_pred, test[target_col].values)
print('R-squared: {}'.format(r2))
# look at the feature importance
importance = rf.feature_importances_
importance = pd.DataFrame(importance, index=feature_cols, columns=['importance'])
print('Feature importance:\n {}'.format(importance))
这将打印出:
R-squared: 0.532308123273
Feature importance:
importance
bathrooms 0.016268
floors 0.010330
bedrooms 0.017346
sqft_living 0.422269
sqft_lot 0.104096
waterfront 0.021439
view 0.037015
condition 0.025751
grade 0.279991
lat 0.000000
sqft_above 0.065496
您可能已经知道,但为了清楚起见,您可以阅读以下内容:
- 随机森林回归器:http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
- R 平方:https://en.wikipedia.org/wiki/Coefficient_of_determination
2。分类问题
您可以通过创建一些价格区间来创建您自己的 类。例如,桶 1:0 - 100,000 美元,桶 2:100,000 美元 - 200,000 美元,依此类推。现在,我只给了它两个 类:
# fake price classes, this is for later
target_binary = ['price_binary']
dataset[target_binary] = (dataset[target_col] > 221900).astype(int)
然后使用 RandomForestClassifier
:
# Using a Random Forest Classifier
# fit the classifier
rfc = RandomForestClassifier(max_depth=30, n_estimators=5)
rfc.fit(train[feature_cols], train[target_binary])
# look at the feature importance
importance_c = rfc.feature_importances_
importance_c = pd.DataFrame(importance_c, index=feature_cols, columns=['importance'])
print('Feature importance:\n {}'.format(importance))
# look at our predictions
y_pred_c = rfc.predict(test[feature_cols])
cm = confusion_matrix(y_pred_c, test[target_binary])
print('Confusion matrix:\n {}'.format(cm))
这将打印以下内容:
Feature importance:
importance
bathrooms 0.018511
floors 0.011572
bedrooms 0.019063
sqft_living 0.455199
sqft_lot 0.113200
waterfront 0.026671
view 0.030930
condition 0.021197
grade 0.235906
lat 0.000000
sqft_above 0.067749
Confusion matrix:
[[ 92 155]
[ 324 5913]]
我目前正在尝试 运行 房屋销售数据集上的随机森林算法。
不幸的是,我正在努力将 X 和 Y 变量放入正确的维度。
起初,我只想包含 4 个特征(如浴室、卧室、平方英尺...)来预测价格(第一列)。
如果我运行像下面这样的代码,我会得到以下错误:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
这是一个非常明确的陈述,所以我通过 ravel 转换了我的 y 变量 (Train_TargetVar):
Train_TargetVar = np.ravel(Train_TargetVar, order='C')
代码 运行 到现在为止,但没有任何意义。 最后的混淆矩阵是这样的:
Confusion matrix
[[1 0 0 ..., 0 0 0]
[0 1 0 ..., 0 0 0]
[0 0 1 ..., 0 0 0]
...,
[0 0 0 ..., 1 0 0]
[0 0 0 ..., 0 1 0]
[0 0 0 ..., 0 1 0]]
恐怕现在有几千行/列 - 没有有意义的结果...
如果有人能给我提示就太好了 and/or 告诉我我的代码的哪一部分必须更改。
# Load Libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np
dataset= pd.read_csv('kc_house_data.csv')
dataset = dataset.drop('id', axis=1)
dataset = dataset.drop('date', axis=1)
dataset = dataset.drop('zipcode', axis=1)
dataset = dataset.drop('long', axis=1)
cols = ['price', 'bathrooms', 'floors', 'bedrooms', 'sqft_living', 'sqft_lot', 'waterfront', 'view', 'condition', 'grade', 'lat', 'sqft_above']
dataset[cols] = dataset[cols].applymap(np.int64)
# Splitting Dataset
Train,Test = train_test_split(dataset, test_size = 0.3, random_state = 176)
Train_IndepentVars = Train.values[:, 1:5]
Train_TargetVar = Train.values[:, 0:1]
print(Train_IndepentVars.shape)
print(Train_TargetVar.shape)
##RF
rf_model = RandomForestClassifier(max_depth=30,n_estimators=5)
rf_model.fit(Train_IndepentVars, Train_TargetVar)
predictions = rf_model.predict(Train_IndepentVars)
###Confusion Matrix
from sklearn.metrics import confusion_matrix
print(" Confusion matrix ", confusion_matrix(Train_TargetVar, predictions))
importance = rf_model.feature_importances_
importance = pd.DataFrame(importance, index=Train.columns[1:5],
columns=["Importance"])
print(importance)
您不需要使用 ravel,它会立即为您带来相同的一维数组:
Train_TargetVar = Train.values[:, 0]
首先,老实说,您看到的警告可以忽略,与您的结果无关。
您的代码按预期工作,但您需要修改您的方法。您所做的是尝试将分类器拟合到回归问题(作为目标变量当前所在的位置),即您正在使用 RandomForestClassifier
来预测 price
,其中 price
可以采用在 4028 个不同的值上。分类器会将其视为 4028 类 [0, 1, 2, 3, 4, 5, ...., 4027]。这就是为什么你会看到疯狂大小的混淆矩阵,它是一个 4028x4028 矩阵。
相反,您应该:
- 将其转化为回归问题
- 修改您的目标变量,使其成为可管理的分类问题
1.回归问题
这是您的代码,修改为使用 RandomForestRegressor
,然后使用 R 平方指标查看其性能
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
df = pd.read_csv('data/kc_house_data.csv')
drop_cols = ['id', 'date', 'zipcode', 'long']
feature_cols = ['bathrooms', 'floors', 'bedrooms',
'sqft_living', 'sqft_lot', 'waterfront', 'view',
'condition', 'grade', 'lat', 'sqft_above']
target_col = ['price']
all_cols = feature_cols + target_col
dataset = df.drop(drop_cols, axis=1)
dataset[all_cols] = dataset[all_cols].applymap(np.int64)
# split dataset for cross-validation
train, test = train_test_split(dataset, test_size=0.3, random_state=176)
# set up our random forest model
rf = RandomForestRegressor(max_depth=30, n_estimators=5)
# fit our model
rf.fit(train[feature_cols].values, train[target_col].values )
# look at our predictions
y_pred = rf.predict(test[feature_cols].values)
r2 = r2_score(y_pred, test[target_col].values)
print('R-squared: {}'.format(r2))
# look at the feature importance
importance = rf.feature_importances_
importance = pd.DataFrame(importance, index=feature_cols, columns=['importance'])
print('Feature importance:\n {}'.format(importance))
这将打印出:
R-squared: 0.532308123273
Feature importance:
importance
bathrooms 0.016268
floors 0.010330
bedrooms 0.017346
sqft_living 0.422269
sqft_lot 0.104096
waterfront 0.021439
view 0.037015
condition 0.025751
grade 0.279991
lat 0.000000
sqft_above 0.065496
您可能已经知道,但为了清楚起见,您可以阅读以下内容:
- 随机森林回归器:http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
- R 平方:https://en.wikipedia.org/wiki/Coefficient_of_determination
2。分类问题
您可以通过创建一些价格区间来创建您自己的 类。例如,桶 1:0 - 100,000 美元,桶 2:100,000 美元 - 200,000 美元,依此类推。现在,我只给了它两个 类:
# fake price classes, this is for later
target_binary = ['price_binary']
dataset[target_binary] = (dataset[target_col] > 221900).astype(int)
然后使用 RandomForestClassifier
:
# Using a Random Forest Classifier
# fit the classifier
rfc = RandomForestClassifier(max_depth=30, n_estimators=5)
rfc.fit(train[feature_cols], train[target_binary])
# look at the feature importance
importance_c = rfc.feature_importances_
importance_c = pd.DataFrame(importance_c, index=feature_cols, columns=['importance'])
print('Feature importance:\n {}'.format(importance))
# look at our predictions
y_pred_c = rfc.predict(test[feature_cols])
cm = confusion_matrix(y_pred_c, test[target_binary])
print('Confusion matrix:\n {}'.format(cm))
这将打印以下内容:
Feature importance:
importance
bathrooms 0.018511
floors 0.011572
bedrooms 0.019063
sqft_living 0.455199
sqft_lot 0.113200
waterfront 0.026671
view 0.030930
condition 0.021197
grade 0.235906
lat 0.000000
sqft_above 0.067749
Confusion matrix:
[[ 92 155]
[ 324 5913]]