如果目标变量未包含在二元分类任务的测试数据中，我应该如何预测它

Question

我有一个包含 2 个数据集（train.csv 和 test.csv）的二元分类任务。训练数据包含自变量（x1、x2、x3）和目标变量（y），而测试仅包含自变量。我想对这两个数据进行预测（逻辑回归）。唯一的问题是我的测试数据没有目标变量。我不确定如何处理此任务，因为我的数据已经拆分并且它们具有不同的行数。如果我缺少目标变量，我该如何对测试集进行预测？下面的示例数据：您可以使用任何模块来演示这一点，没关系。我只想看看方法，例如 sklearn。

data1 = {'x1':['Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female'],
    'x2':[13, 20, 21, 19, 18, 78, 22, 33, 56, 10],
    'x3': [335.5, 455.3, 109.4, 228.0, 220.9, -1.223, 700.4, 446.9, 499.1, 776.4],
    'y': [1, 0, 0, 1, 0, 0, 0, 1, 0, 0,]
   }

火车 = pd.DataFrame(数据 1) 火车

data2 = {'x1':['Female', 'Female', 'Male', 'Male', 'Male'],
    'x2':[16, 20, 33, 29, 18, ],
    'x3': [235.1, 395.0, 290.3, 118.6, 345.1]
   }

测试=pd.DataFrame(数据2) 测试

Answer 1

顾名思义，测试数据集仅用于 evaluation/testing 您的模型。您的任务是通过训练数据集学习模型来生成测试数据的预测。在训练期间，您使用训练数据集的给定 annotations/labels（您称为 'response variables'）来拟合模型。

您可以了解有关此概念的更多信息，例如here.

对于您的目标是学习逻辑回归模型的情况，您可以使用训练数据集中给定的数据预测对 ((x1, x2, x3), y) 来学习模型参数。训练模型后，您现在可以为新数据创建预测。因此，根据您的测试数据集，您现在可以输入数据点 (x1, x2, x3) 以根据您的模型获得分类结果 y。

使用 sklearn 和您提供的数据样本：

from sklearn.linear_model import LogisticRegression

train = pd.DataFrame(data1)
test = pd.DataFrame(data2)

# create np.arrays from the trainings data
X_train = np.array([(train['x1']=='Male').astype(int), train['x2'], train['x3']]).T
y_train = np.array(train['y'])  # labels

# train the model = fit logistic function to trainings data
model = LogisticRegression().fit(X_train, y_train)

# Create predictions on the test set
X_test = np.array([(test['x1']=='Male').astype(int), test['x2'], test['x3']]).T
y_test = model.predict(X_test)  # create y-labels through the learned model
print(y_test)

通常可以将测试数据集的预测提交到某处，以评估您的模型对数据的分类效果。

简而言之：训练数据集用于learn/fit模型，测试数据集用于评估性能

如果目标变量未包含在二元分类任务的测试数据中，我应该如何预测它

How should I predict Target Variable if it is not included in the test data for a binary classification task

python

pandas

data-science