预测单个数据实例时与 OneHotEncoder 的功能不匹配

Question

onehotencoder 如何进行单值预测

Error Msg- ValueError: Number of features of the model must match the input. Model n_features is 1261 and input n_features is 16

我正在对文本数据训练随机森林 classifier。我正在计算此文本数据的每个实例的 16 个特征。由于对所有这 16 个变量进行了分类，因此我对这 16 个变量中的每一个都使用 OneHotEncoder 来对其进行编码。这导致训练矩阵有 1261 列。我也为这些做了特征缩放。我还对我的训练数据进行了 80:20 train:test 拆分，并应用预测器来获得混淆矩阵，class 化报告。我还在我的本地磁盘上以 pickle 格式保存 classifier、标准缩放器变量、onehotencoder 变量。

现在我想在新的单独文件中创建预测器的服务 (REST)。 API 将使用 .pkl 格式的保存模型并预测新的单个文本值的值 - 基本上给出其预测的 class 名称和相应的置信度分数。

我面临的问题是：当我对这个单个文本值进行编码时，我得到了一个具有 16 个特征的向量。它不会被编码为 1261 特征。因此，当我运行这个 class 新文本上的 predict() 函数时，它会给我以下错误：

% (self.n_features_, n_features)) ValueError: Number of features of the model must match the input. Model n_features is 1261 and input n_features is 16

当编码矩阵与先前训练的 classifier 的大小不匹配时，我如何使用反序列化的 pkl 模型来预测单个实例？如何解决这个问题。

编辑： 同时发布代码片段和异常堆栈：

# Loading the .pkl files used in training
with open('model.pkl', 'rb') as f_model:
    classifier = pickle.load(f_model) # trained classifier model

with open('labelencoder_file.pkl', 'rb') as f_lblenc:
    label_encoder = pickle.load(f_lblenc) # label encoder object used in training

with open('encoder_file.pkl', 'rb') as f_onehotenc:
    onehotencoder = pickle.load(f_onehotenc) # onehotencoder object used in training

with open('sc_file.pkl', 'rb') as f_sc:
    scaler = pickle.load(f_sc) # standard scaler object used in training

X = df_features # df_features is the dataframe containing the computed feature values. It has 16 columns as 16 features have been computed for the new value
X.values[:, 0] = label_encoder.fit_transform(X.values[:, 0])
X.values[:, 1] = label_encoder.fit_transform(X.values[:, 1])
# This is repeated  till X.values[:, 15] as all features are categorical

X = onehotencoder.fit_transform(X).toarray()
X = scaler.fit_transform(X)
print(X.shape) # This prints (1, 16), thus showing that encoding has not worked properly

y_pred = classifier.predict(X) # This throws the exception

回溯（最近调用最后）：

文件“/home/Test/api.py”，第 256 行，在 api_func() y_pred = classifier.predict(X)

文件“/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py”，第 538 行，在预测中 proba = self.predict_proba(X)

文件“/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py”，第 578 行，在 predict_proba X = self._validate_X_predict(X)

文件“/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py”，第 357 行，在 _validate_X_predict return self.estimators_[0]._validate_X_predict(X, check_input=True)

文件“/usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py”，第 384 行，在 _validate_X_predict % (self.n_features_, n_features))

ValueError：模型的特征数量必须与输入匹配。型号n_features为1261，输入n_features为16

Answer 1

在这里发布解决问题的修改后的代码

'''Loading .pkl files that were persisted during training'''
with open('model.pkl', 'rb') as f_model:
    classifier = pickle.load(f_model) # trained classifier model

with open('labelencoder00.pkl', 'rb') as f_lblenc00:
    label_encoder00 = pickle.load(f_lblenc00) # LabelEncoder() object that was used for encoding the first categorical variable
with open('labelencoder01.pkl', 'rb') as f_lblenc01:
    label_encoder01 = pickle.load(f_lblenc01) # LabelEncoder() object that was used for encoding the second categorical variable

with open('onehotencoder.pkl', 'rb') as f_onehotenc:
    onehotencoder = pickle.load(f_onehotenc) # OneHotEncoder object that was used in training


X = df_features # df_features is the dataframe containing the computed feature values
X.values[:, 0] = label_encoder00.transform(X.values[:, 0])
X.values[:, 1] = label_encoder01.transform(X.values[:, 1])

X = onehotencoder.transform(X).toarray()

pred = classifier.predict(X)

预测单个数据实例时与 OneHotEncoder 的功能不匹配

Feature Mismatch with OneHotEncoder while predicting for a single instance of data

machine-learning

random-forest

scikit-learn