train/test 拆分之前或之后的 Scikit-Learn One-hot-encode
Scikit-Learn One-hot-encode before or after train/test split
我正在查看两种使用 scikit-learn 构建模型的场景,但我无法弄清楚为什么其中一个返回的结果与另一个有着根本不同。这两种情况(据我所知)之间唯一不同的是,在一种情况下,我一次性对所有分类变量进行编码(在整个数据上),然后在训练和测试之间进行拆分。在第二种情况下,我在训练和测试之间进行拆分,然后根据训练数据对这两个集合进行单热编码。
后一种情况在技术上更适合判断过程的泛化错误,但这种情况返回的归一化基尼系数与第一种情况相比有很大不同(而且不好 - 基本上没有模型)。我知道第一种情况 gini (~0.33) 符合基于此数据构建的模型。
为什么第二种情况会返回如此不同的基尼值?仅供参考数据集包含数字和分类变量的混合。
方法一(one-hot encode整个数据然后拆分)这个returns:Validation Sample Score: 0.3454355044 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
def gini(solution, submission):
df = zip(solution, submission, range(len(solution)))
df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
rand = [float(i+1)/float(len(df)) for i in range(len(df))]
totalPos = float(sum([x[0] for x in df]))
cumPosFound = [df[0][0]]
for i in range(1,len(df)):
cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
Lorentz = [float(x)/totalPos for x in cumPosFound]
Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
return sum(Gini)
def normalized_gini(solution, submission):
normalized_gini = gini(solution, submission)/gini(solution, solution)
return normalized_gini
# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)
if __name__ == '__main__':
dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
y=dat[['Hazard']].values.ravel()
dat=dat.drop(['Hazard','Id'],axis=1)
folds=train_test_split(range(len(y)),test_size=0.30, random_state=15) #30% test
#First one hot and make a pandas df
dat_dict=dat.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
dat= vectorizer.transform( dat_dict )
dat=pd.DataFrame(dat)
train_X=dat.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=dat.iloc[folds[1],:]
test_y=y[folds[1]]
rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
方法二(先split再one-hot encode)这个returns:Validation Sample Score: 0.0055124452 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
def gini(solution, submission):
df = zip(solution, submission, range(len(solution)))
df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
rand = [float(i+1)/float(len(df)) for i in range(len(df))]
totalPos = float(sum([x[0] for x in df]))
cumPosFound = [df[0][0]]
for i in range(1,len(df)):
cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
Lorentz = [float(x)/totalPos for x in cumPosFound]
Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
return sum(Gini)
def normalized_gini(solution, submission):
normalized_gini = gini(solution, submission)/gini(solution, solution)
return normalized_gini
# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)
if __name__ == '__main__':
dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
y=dat[['Hazard']].values.ravel()
dat=dat.drop(['Hazard','Id'],axis=1)
folds=train_test_split(range(len(y)),test_size=0.3, random_state=15) #30% test
#first split
train_X=dat.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=dat.iloc[folds[1],:]
test_y=y[folds[1]]
#One hot encode the training X and transform the test X
dat_dict=train_X.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
train_X= vectorizer.transform( dat_dict )
train_X=pd.DataFrame(train_X)
dat_dict=test_X.T.to_dict().values()
test_X= vectorizer.transform( dat_dict )
test_X=pd.DataFrame(test_X)
rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
我无法将你的代码获取到 运行,但我的猜测是在测试数据集中
- 您没有看到某些分类变量的所有级别,因此如果您仅根据此数据计算虚拟变量,您实际上会有不同的列。
- 否则,也许您有相同的列但顺序不同?
虽然之前的评论正确地建议最好首先映射整个特征 space,但在您的情况下,训练和测试都包含所有列中的所有特征值。
如果比较两个版本的vectorizer.vocabulary_
,完全一样,所以映射没有区别。因此,它不会导致问题。
方法 2 失败的原因是因为当您执行此命令时,您的 dat_dict
按原始索引 重新排序 。
dat_dict=train_X.T.to_dict().values()
换句话说,train_X
在这行代码中有一个打乱的索引。当你把它变成 dict
时,dict
顺序重新排序为原始索引的数字顺序。这会导致您的训练和测试数据与 y
.
完全不相关
方法 1 不会遇到此问题,因为您在映射后打乱了数据。
您可以通过在方法 2 中分配 dat_dict
两次时添加 .reset_index()
来解决此问题,例如,
dat_dict=train_X.reset_index(drop=True).T.to_dict().values()
这确保在转换为 dict
时保留数据顺序。
当我添加那段代码时,我得到以下结果:
- 方法 1:验证样本得分:0.3454355044(归一化基尼系数)
- 方法 2:验证样本分数:0.3438430991(归一化基尼系数)
我正在查看两种使用 scikit-learn 构建模型的场景,但我无法弄清楚为什么其中一个返回的结果与另一个有着根本不同。这两种情况(据我所知)之间唯一不同的是,在一种情况下,我一次性对所有分类变量进行编码(在整个数据上),然后在训练和测试之间进行拆分。在第二种情况下,我在训练和测试之间进行拆分,然后根据训练数据对这两个集合进行单热编码。
后一种情况在技术上更适合判断过程的泛化错误,但这种情况返回的归一化基尼系数与第一种情况相比有很大不同(而且不好 - 基本上没有模型)。我知道第一种情况 gini (~0.33) 符合基于此数据构建的模型。
为什么第二种情况会返回如此不同的基尼值?仅供参考数据集包含数字和分类变量的混合。
方法一(one-hot encode整个数据然后拆分)这个returns:Validation Sample Score: 0.3454355044 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
def gini(solution, submission):
df = zip(solution, submission, range(len(solution)))
df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
rand = [float(i+1)/float(len(df)) for i in range(len(df))]
totalPos = float(sum([x[0] for x in df]))
cumPosFound = [df[0][0]]
for i in range(1,len(df)):
cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
Lorentz = [float(x)/totalPos for x in cumPosFound]
Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
return sum(Gini)
def normalized_gini(solution, submission):
normalized_gini = gini(solution, submission)/gini(solution, solution)
return normalized_gini
# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)
if __name__ == '__main__':
dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
y=dat[['Hazard']].values.ravel()
dat=dat.drop(['Hazard','Id'],axis=1)
folds=train_test_split(range(len(y)),test_size=0.30, random_state=15) #30% test
#First one hot and make a pandas df
dat_dict=dat.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
dat= vectorizer.transform( dat_dict )
dat=pd.DataFrame(dat)
train_X=dat.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=dat.iloc[folds[1],:]
test_y=y[folds[1]]
rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
方法二(先split再one-hot encode)这个returns:Validation Sample Score: 0.0055124452 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
def gini(solution, submission):
df = zip(solution, submission, range(len(solution)))
df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
rand = [float(i+1)/float(len(df)) for i in range(len(df))]
totalPos = float(sum([x[0] for x in df]))
cumPosFound = [df[0][0]]
for i in range(1,len(df)):
cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
Lorentz = [float(x)/totalPos for x in cumPosFound]
Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
return sum(Gini)
def normalized_gini(solution, submission):
normalized_gini = gini(solution, submission)/gini(solution, solution)
return normalized_gini
# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)
if __name__ == '__main__':
dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
y=dat[['Hazard']].values.ravel()
dat=dat.drop(['Hazard','Id'],axis=1)
folds=train_test_split(range(len(y)),test_size=0.3, random_state=15) #30% test
#first split
train_X=dat.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=dat.iloc[folds[1],:]
test_y=y[folds[1]]
#One hot encode the training X and transform the test X
dat_dict=train_X.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
train_X= vectorizer.transform( dat_dict )
train_X=pd.DataFrame(train_X)
dat_dict=test_X.T.to_dict().values()
test_X= vectorizer.transform( dat_dict )
test_X=pd.DataFrame(test_X)
rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
我无法将你的代码获取到 运行,但我的猜测是在测试数据集中
- 您没有看到某些分类变量的所有级别,因此如果您仅根据此数据计算虚拟变量,您实际上会有不同的列。
- 否则,也许您有相同的列但顺序不同?
虽然之前的评论正确地建议最好首先映射整个特征 space,但在您的情况下,训练和测试都包含所有列中的所有特征值。
如果比较两个版本的vectorizer.vocabulary_
,完全一样,所以映射没有区别。因此,它不会导致问题。
方法 2 失败的原因是因为当您执行此命令时,您的 dat_dict
按原始索引 重新排序 。
dat_dict=train_X.T.to_dict().values()
换句话说,train_X
在这行代码中有一个打乱的索引。当你把它变成 dict
时,dict
顺序重新排序为原始索引的数字顺序。这会导致您的训练和测试数据与 y
.
方法 1 不会遇到此问题,因为您在映射后打乱了数据。
您可以通过在方法 2 中分配 dat_dict
两次时添加 .reset_index()
来解决此问题,例如,
dat_dict=train_X.reset_index(drop=True).T.to_dict().values()
这确保在转换为 dict
时保留数据顺序。
当我添加那段代码时,我得到以下结果:
- 方法 1:验证样本得分:0.3454355044(归一化基尼系数)
- 方法 2:验证样本分数:0.3438430991(归一化基尼系数)