训练模型时我做错了什么？

Question

我解决了以下问题：`

We have collected more data on cats and dogs, and are ready to train our robot to classify them! Download a training dataset https://stepik.org/media/attachments/course/4852/dogs_n_cats.csv and train the Decision Tree on it. After that, download the dataset from the assignment and predict which observations belong to whom. Enter the number of dogs in your dataset. A certain error is allowed in the assignment.

我训练了模型：

import sklearn
import pandas as pd
import numpy as nm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_val_score

df = pd.read_csv('dogs_n_cats.csv')

X = df.drop(['Вид', 'Шерстист'], axis=1)
y = df['Вид']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67, random_state=42)

clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=4)
clf.fit(X_train, y_train)

之后，我从任务https://stepik.org/api/attempts/540562013/file下载了数据集，开始确定数据集中狗的数量：

df2 = pd.read_json('we.txt')

X2 = df.drop(['Вид', 'Шерстист'], axis=1)
y2 = df['Вид']
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, train_size=0.67, random_state=42)

df2_predict = clf.predict(X2)
l = list(df2_predict)
l.count('собачка')

任务中狗的数量应该是49，但是执行l.count（'dog'）后我得到了500，我在训练模型时做错了什么？

Answer 1

这似乎是一个错字。在您的代码段中，您正在使用第一个数据框来创建 X2.

我无法访问第二个文件，但更改此行应该可以解决问题：

X2 = df.drop(['Вид', 'Шерстист'], axis=1)
-->
X2 = df2.drop(['Вид', 'Шерстист'], axis=1)

除此之外，您已经获得了训练集和测试集，因此 none 次调用 train_test_split 应该是必要的。

训练模型时我做错了什么？

What am I doing wrong when training a model?

python

pandas

scikit-learn