将预测结果合并到原始数据框?
Merging results from Prediction to Original Data frame?
我已经完成了一种机器学习算法,可以根据文本对类别进行分类。我已经完成了 99%,但是我现在知道要将我的预测结果合并回原始数据框,以查看我开始的内容和预测内容的打印视图。
下面是我的代码。
#imports data from excel file and shows first 5 rows of data
file_name = r'C:\Users\aac1928\Documents\Machine Learning\Training Data\RFP Training Data.xlsx'
sheet = 'Sheet1'
import pandas as pd
import numpy
import xlsxwriter
import sklearn
df = pd.read_excel(io=file_name,sheet_name=sheet)
#extracts specifics rows from data
data = df.iloc[: , [0,2]]
print(data)
#Gets data ready for model
newdata = df.iloc[:,[1,2]]
newdata = newdata.rename(columns={'Label':'label'})
newdata = newdata.rename(columns={'RFP Question':'question'})
print(newdata)
# how to define X and yfor use with COUNTVECTORIZER
X = newdata.question
y = newdata.label
print(X.shape)
print(y.shape)
# split X and y into training and testing sets
X_train = X
y_train = y
X_test = newdata.question[:50]
y_test = newdata.label[:50]
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)
y_pred_class
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
这是我添加的新数据,用于根据与数组相同的长度进行预测
# split X and y into training and testing sets
X_train = X
y_train = y
X_testnew = dfpred.question
y_testnew = dfpred.label
print(X_train.shape)
print(X_testnew.shape)
print(y_train.shape)
print(y_testnew.shape)
(447,)
(168,)
(447,)
(168,)
# transform new testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm_new = vect.transform(X_testnew)
X_test_dtm_new
<168x1382 '' 类型的稀疏矩阵
以压缩稀疏行格式存储了 2240 个元素>
# make class predictions for new X_test_dtm
y_pred_class_new = nb.predict(X_test_dtm_new)
y_pred_class_new
数组([ 3, 3, 19, 18, 5, 10, 10, 5, 19, 3, 3, 3, 5, 3, 3, 3, 3,
9, 19, 5, 5, 10, 9, 5, 18, 19, 9, 9, 19, 19, 18, 18, 18, 4,
18, 3, 9, 18, 19, 19, 18, 19, 5, 19, 19, 3, 3, 18, 18, 5, 18,
3, 4, 5, 6, 4, 5, 19, 19, 5, 5, 19, 19, 4, 5, 18, 5, 5,
19, 5, 18, 5, 19, 18, 19, 5, 7, 5, 9, 9, 9, 9, 10, 9, 9,
5, 5, 5, 5, 3, 18, 4, 9, 5, 3, 6, 9, 18, 7, 5, 9, 5,
5, 19, 5, 5, 19, 5, 6, 5, 5, 6, 9, 21, 10, 9, 18, 9, 9,
3, 18, 5, 6, 18, 6, 3, 6, 5, 18, 6, 5, 18, 5, 6, 7, 7,
5, 7, 19, 18, 6, 5, 5, 5, 5, 5, 19, 16, 5, 19, 5, 5, 5,
5, 19, 5, 7, 19, 6, 7, 3, 18, 18, 18, 6, 19, 19, 7],
dtype=int64)
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob_new = logreg.predict_proba(X_test_dtm_new)[:, 1]
y_pred_prob_new
df['prediction'] = pd.Series(y_pred_class_new)
dfout = pd.merge(dfpred,df['prediction'].dropna() .to_frame(),how = 'left',left_index = True, right_index = True)
打印(dfout)
我希望这能帮助我尽可能清楚
我认为既然你的预测只是一个数组,你最好只使用:
df['predictions'] = y_pred_class
我认为你的问题是你的预测数组比原来的数组短 df
因为你分成了训练集和测试集。
您定义为 newdata.question[:50]
的 X_test
数组,看起来您正在获取该列的最后 50 行。
我要做的是创建一个 prediction_df 与您的预测数组长度相同。在您的情况下,您需要的行是原始 df 的最后 50 行。
prediction_df = df.iloc[:50]
prediction_df['predictions'] = y_pred_class
只需确保您的 prediction_df 行与您用来创建的行匹配 X_test
!
我已经完成了一种机器学习算法,可以根据文本对类别进行分类。我已经完成了 99%,但是我现在知道要将我的预测结果合并回原始数据框,以查看我开始的内容和预测内容的打印视图。
下面是我的代码。
#imports data from excel file and shows first 5 rows of data
file_name = r'C:\Users\aac1928\Documents\Machine Learning\Training Data\RFP Training Data.xlsx'
sheet = 'Sheet1'
import pandas as pd
import numpy
import xlsxwriter
import sklearn
df = pd.read_excel(io=file_name,sheet_name=sheet)
#extracts specifics rows from data
data = df.iloc[: , [0,2]]
print(data)
#Gets data ready for model
newdata = df.iloc[:,[1,2]]
newdata = newdata.rename(columns={'Label':'label'})
newdata = newdata.rename(columns={'RFP Question':'question'})
print(newdata)
# how to define X and yfor use with COUNTVECTORIZER
X = newdata.question
y = newdata.label
print(X.shape)
print(y.shape)
# split X and y into training and testing sets
X_train = X
y_train = y
X_test = newdata.question[:50]
y_test = newdata.label[:50]
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)
y_pred_class
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
这是我添加的新数据,用于根据与数组相同的长度进行预测
# split X and y into training and testing sets
X_train = X
y_train = y
X_testnew = dfpred.question
y_testnew = dfpred.label
print(X_train.shape)
print(X_testnew.shape)
print(y_train.shape)
print(y_testnew.shape)
(447,) (168,) (447,) (168,)
# transform new testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm_new = vect.transform(X_testnew)
X_test_dtm_new
<168x1382 '' 类型的稀疏矩阵 以压缩稀疏行格式存储了 2240 个元素>
# make class predictions for new X_test_dtm
y_pred_class_new = nb.predict(X_test_dtm_new)
y_pred_class_new
数组([ 3, 3, 19, 18, 5, 10, 10, 5, 19, 3, 3, 3, 5, 3, 3, 3, 3, 9, 19, 5, 5, 10, 9, 5, 18, 19, 9, 9, 19, 19, 18, 18, 18, 4, 18, 3, 9, 18, 19, 19, 18, 19, 5, 19, 19, 3, 3, 18, 18, 5, 18, 3, 4, 5, 6, 4, 5, 19, 19, 5, 5, 19, 19, 4, 5, 18, 5, 5, 19, 5, 18, 5, 19, 18, 19, 5, 7, 5, 9, 9, 9, 9, 10, 9, 9, 5, 5, 5, 5, 3, 18, 4, 9, 5, 3, 6, 9, 18, 7, 5, 9, 5, 5, 19, 5, 5, 19, 5, 6, 5, 5, 6, 9, 21, 10, 9, 18, 9, 9, 3, 18, 5, 6, 18, 6, 3, 6, 5, 18, 6, 5, 18, 5, 6, 7, 7, 5, 7, 19, 18, 6, 5, 5, 5, 5, 5, 19, 16, 5, 19, 5, 5, 5, 5, 19, 5, 7, 19, 6, 7, 3, 18, 18, 18, 6, 19, 19, 7], dtype=int64)
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob_new = logreg.predict_proba(X_test_dtm_new)[:, 1]
y_pred_prob_new
df['prediction'] = pd.Series(y_pred_class_new)
dfout = pd.merge(dfpred,df['prediction'].dropna() .to_frame(),how = 'left',left_index = True, right_index = True)
打印(dfout)
我希望这能帮助我尽可能清楚
我认为既然你的预测只是一个数组,你最好只使用:
df['predictions'] = y_pred_class
我认为你的问题是你的预测数组比原来的数组短 df
因为你分成了训练集和测试集。
您定义为 newdata.question[:50]
的 X_test
数组,看起来您正在获取该列的最后 50 行。
我要做的是创建一个 prediction_df 与您的预测数组长度相同。在您的情况下,您需要的行是原始 df 的最后 50 行。
prediction_df = df.iloc[:50]
prediction_df['predictions'] = y_pred_class
只需确保您的 prediction_df 行与您用来创建的行匹配 X_test
!