Scikit Learn - 拟合和预测输入的顺序,重要吗?
Scikitlearn - order of fit and predict inputs, does it matter?
刚刚开始使用这个库...在使用 RandomForestClassifiers 时遇到了一些问题(我已经阅读了文档但没有弄清楚)
我的问题很简单,假设我有一个像
这样的火车数据集
A B C
1 2 3
其中 A 是自变量 (y),B-C 是因变量 (x)。假设测试集看起来一样,但是顺序是
B A C
1 2 3
当我打电话给forest.fit(train_data[0:,1:],train_data[0:,0])
然后我需要重新排序测试集以匹配 运行 之前的这个顺序吗? (忽略我需要删除已经预测的 y 值 (a) 的事实,所以就说 B 和 C 是乱序的...)
是的,您需要重新排序。想象一个更简单的情况,线性回归。该算法将计算每个特征的权重,例如,如果特征 1 不重要,它将被分配接近 0 的权重。
如果在预测时顺序不同,一个重要的特征将乘以这个几乎为零的权重,预测将完全失败。
裂解酶是正确的。 scikit-learn
只会按照您指定的顺序获取数据。因此,您必须确保数据在训练和预测期间的顺序相同。
这是一个简单的示例:
培训时间:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
x = pd.DataFrame({
'feature_1': [0, 0, 1, 1],
'feature_2': [0, 1, 0, 1]
})
y = [0, 0, 1, 1]
model.fit(x, y)
# we now have a model that
# (i) predicts 0 when x = [0, 0] or [0, 1], and
# (ii) predicts 1 when x = [1, 0] or [1, 1]
预测时间:
# positive example
http_request_payload = {
'feature_1': 0,
'feature_2': 1
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 0, as expected
# negative example
http_request_payload = {
'feature_2': 1, # notice that the order is jumbled up
'feature_1': 0
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 1, when it should have returned 0.
# scikit-learn doesn't care about the key-value mapping of the features.
# it simply vectorizes the dataframe in whatever order it comes in.
这就是我在训练期间缓存列顺序的方式,以便我可以在预测时使用它。
# training
x = pd.DataFrame([...])
column_order = x.columns
model = SomeModel().fit(x, y) # train model
# save the things that we need at prediction time. you can also use pickle if you don't want to pip install joblib
import joblib
joblib.dump(model, 'my_model.joblib')
joblib.dump(column_order, 'column_order.txt')
# load the artifacts from disk
model = joblib.load('linear_model.joblib')
column_order = joblib.load('column_order.txt')
# imaginary http request payload
request_payload = { 'feature_1': ..., 'feature_1': ... }
# create empty dataframe with the right shape and order (using column_order)
input_features = pd.DataFrame([], columns=column_order)
input_features = input_features.append(request_payload, ignore_index=True)
input_features = input_features.fillna(0) # handle any missing data however you like
model.predict(input_features.values.tolist())
刚刚开始使用这个库...在使用 RandomForestClassifiers 时遇到了一些问题(我已经阅读了文档但没有弄清楚)
我的问题很简单,假设我有一个像
这样的火车数据集A B C
1 2 3
其中 A 是自变量 (y),B-C 是因变量 (x)。假设测试集看起来一样,但是顺序是
B A C
1 2 3
当我打电话给forest.fit(train_data[0:,1:],train_data[0:,0])
然后我需要重新排序测试集以匹配 运行 之前的这个顺序吗? (忽略我需要删除已经预测的 y 值 (a) 的事实,所以就说 B 和 C 是乱序的...)
是的,您需要重新排序。想象一个更简单的情况,线性回归。该算法将计算每个特征的权重,例如,如果特征 1 不重要,它将被分配接近 0 的权重。
如果在预测时顺序不同,一个重要的特征将乘以这个几乎为零的权重,预测将完全失败。
裂解酶是正确的。 scikit-learn
只会按照您指定的顺序获取数据。因此,您必须确保数据在训练和预测期间的顺序相同。
这是一个简单的示例:
培训时间:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
x = pd.DataFrame({
'feature_1': [0, 0, 1, 1],
'feature_2': [0, 1, 0, 1]
})
y = [0, 0, 1, 1]
model.fit(x, y)
# we now have a model that
# (i) predicts 0 when x = [0, 0] or [0, 1], and
# (ii) predicts 1 when x = [1, 0] or [1, 1]
预测时间:
# positive example
http_request_payload = {
'feature_1': 0,
'feature_2': 1
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 0, as expected
# negative example
http_request_payload = {
'feature_2': 1, # notice that the order is jumbled up
'feature_1': 0
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 1, when it should have returned 0.
# scikit-learn doesn't care about the key-value mapping of the features.
# it simply vectorizes the dataframe in whatever order it comes in.
这就是我在训练期间缓存列顺序的方式,以便我可以在预测时使用它。
# training
x = pd.DataFrame([...])
column_order = x.columns
model = SomeModel().fit(x, y) # train model
# save the things that we need at prediction time. you can also use pickle if you don't want to pip install joblib
import joblib
joblib.dump(model, 'my_model.joblib')
joblib.dump(column_order, 'column_order.txt')
# load the artifacts from disk
model = joblib.load('linear_model.joblib')
column_order = joblib.load('column_order.txt')
# imaginary http request payload
request_payload = { 'feature_1': ..., 'feature_1': ... }
# create empty dataframe with the right shape and order (using column_order)
input_features = pd.DataFrame([], columns=column_order)
input_features = input_features.append(request_payload, ignore_index=True)
input_features = input_features.fillna(0) # handle any missing data however you like
model.predict(input_features.values.tolist())