随机森林误差(样本数量不一致的输入变量)
Random Forest Error (input variables with inconsistent numbers of samples)
在阅读了这么多有 'inconsistent number of samples' 个错误的示例之后,我仍然看不出我的代码有什么问题。
在 excel 文件中,sheet 1 包含数据。 Sheet 2 包含入围变量列表。
我把sheet2中的变量保存到一个数组中。并将其提供给随机森林模型以评估其对 sheet 1 中参数的影响。
但我得到 "Found input variables with inconsistent numbers of samples: [54, 2016]"
54是sheet2中的变量个数。
2016是sheet1中的数据行数。
我想看看这 54 个变量如何影响 sheet 1 中的 'Target' 变量。
我应该如何操作我的数据才能使这项工作正常进行?
非常感谢。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvar = list()
for each_var in df2.columns:
allvar.append(each_var)
allvar = np.array(allvar)
print(allvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.6)
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf.fit(allvar_train, target_train)
for feature in zip(feat_labels, clf.feature_importances_):
print(feature)
Sheet 1(保存为df)看起来像这样
Sheet 1
Sheet 2(保存为df2)看起来像这样
Sheet2
错误日志如图
Error log
错误日志 2:未知标签类型:'continuous'Error Log 2
allvar_train
target train
问题出在 'train_test_spilt',您只传递特征列名称而不传递数据。像这样使用列列表从 DataFrame 中获取数据。
allvar_train,allvar_test,target_train,target_test= train_test_split(df[allvar],target, random_state=0, test_size=0.6)
你不一定需要将'allvar'和'target'转换为numpy数组它可以直接用于'train_test_split'。
注:此问题与随机森林无关
这是适合我的代码。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvarlist = list()
for each_var in df2.columns:
allvarlist.append(each_var)
countvar = len(allvarlist)
allvar = df[allvarlist]
allvar = allvar.values.reshape(len(allvar),countvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)
clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
#print(allvar_train)
#print(target_train)
clf.fit(allvar_train,np.ravel(target_train))
for feature in zip(allvarlist, clf.feature_importances_):
print(feature)
importances = clf.feature_importances_
#indices = np.argsort(importances)
plt.figure().set_size_inches(14,16)
plt.barh(range(allvar_train.shape[1]), importances, color="r")
plt.yticks(range(allvar_train.shape[1]),allvarlist)
在阅读了这么多有 'inconsistent number of samples' 个错误的示例之后,我仍然看不出我的代码有什么问题。
在 excel 文件中,sheet 1 包含数据。 Sheet 2 包含入围变量列表。
我把sheet2中的变量保存到一个数组中。并将其提供给随机森林模型以评估其对 sheet 1 中参数的影响。
但我得到 "Found input variables with inconsistent numbers of samples: [54, 2016]"
54是sheet2中的变量个数。 2016是sheet1中的数据行数。
我想看看这 54 个变量如何影响 sheet 1 中的 'Target' 变量。
我应该如何操作我的数据才能使这项工作正常进行?
非常感谢。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvar = list()
for each_var in df2.columns:
allvar.append(each_var)
allvar = np.array(allvar)
print(allvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.6)
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf.fit(allvar_train, target_train)
for feature in zip(feat_labels, clf.feature_importances_):
print(feature)
Sheet 1(保存为df)看起来像这样 Sheet 1
Sheet 2(保存为df2)看起来像这样 Sheet2
错误日志如图 Error log
错误日志 2:未知标签类型:'continuous'Error Log 2
allvar_train
target train
问题出在 'train_test_spilt',您只传递特征列名称而不传递数据。像这样使用列列表从 DataFrame 中获取数据。
allvar_train,allvar_test,target_train,target_test= train_test_split(df[allvar],target, random_state=0, test_size=0.6)
你不一定需要将'allvar'和'target'转换为numpy数组它可以直接用于'train_test_split'。
注:此问题与随机森林无关
这是适合我的代码。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvarlist = list()
for each_var in df2.columns:
allvarlist.append(each_var)
countvar = len(allvarlist)
allvar = df[allvarlist]
allvar = allvar.values.reshape(len(allvar),countvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)
clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
#print(allvar_train)
#print(target_train)
clf.fit(allvar_train,np.ravel(target_train))
for feature in zip(allvarlist, clf.feature_importances_):
print(feature)
importances = clf.feature_importances_
#indices = np.argsort(importances)
plt.figure().set_size_inches(14,16)
plt.barh(range(allvar_train.shape[1]), importances, color="r")
plt.yticks(range(allvar_train.shape[1]),allvarlist)