如何在 Scikit-Learn 中对训练和测试数据进行分层？

Question

我正在尝试为 Iris 数据集（从 Kaggle 下载）实施分类算法。在物种列中，类（Iris-setosa、Iris-versicolor、Iris-virginica）按排序顺序排列。如何使用 Scikit-Learn 对训练和测试数据进行分层？

Answer 1

使用 sklearn.model_selection.train_test_split 并使用 Shuffle 参数。

shuffle：布尔值，可选（默认值=True）拆分前是否对数据进行混洗。如果 shuffle=False，则分层必须 None。

Answer 2

为了确保这三个类在你的训练和测试中得到平等的体现，你可以使用 [=] 的 stratify 参数19=]函数。

from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test  = train_test_split(X, y, stratify = y)

这将确保所有类的比例保持相等。

Answer 3

如果您想以 0.3 的测试比率对数据进行洗牌和拆分，您可以使用

sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

其中 X 是您的数据，y 是相应的标签，test_size 是应该保留用于测试的数据百分比， shuffle=True 在拆分前打乱数据

为了保证数据按列均分，可以给stratify参数

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    shuffle=True, 
                                stratify = X['YOUR_COLUMN_LABEL'])

如何在 Scikit-Learn 中对训练和测试数据进行分层？

How to stratify the training and testing data in Scikit-Learn?

python

machine-learning

pandas

scikit-learn

multiclass-classification