sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"
sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"
我试图使用 Scikit-learn 的分层随机拆分拆分示例数据集。我遵循了 Scikit-learn 文档中显示的示例 here
import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
# separate target variable from dataset
target = wine['quality']
data = wine.drop('quality',axis = 1)
# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)
for train_index, test_index in sss:
xtrain, xtest = data[train_index], data[test_index]
ytrain, ytest = target[train_index], target[test_index]
# Check target series for distribution of classes
ytrain.value_counts()
ytest.value_counts()
但是,在 运行 这个脚本上,我收到以下错误:
IndexError: indices are out-of-bounds
有人可以指出我在这里做错了什么吗?谢谢!
您 运行 进入 Pandas DataFrame
索引与 NumPy ndarray
索引的不同约定。数组 train_index
和 test_index
是行索引的集合。但是 data
是一个 Pandas DataFrame
对象,当您对该对象使用单个索引时,如 data[train_index]
,Pandas 期望 train_index
以包含 列 标签而不是行索引。您可以使用 .values
:
将数据帧转换为 NumPy 数组
data_array = data.values
for train_index, test_index in sss:
xtrain, xtest = data_array[train_index], data_array[test_index]
ytrain, ytest = target[train_index], target[test_index]
或使用 Pandas .iloc
访问器:
for train_index, test_index in sss:
xtrain, xtest = data.iloc[train_index], data.iloc[test_index]
ytrain, ytest = target[train_index], target[test_index]
我倾向于第二种方法,因为它给出 xtrain
和 xtest
类型 DataFrame
而不是 ndarray
,因此保留了列标签。
我试图使用 Scikit-learn 的分层随机拆分拆分示例数据集。我遵循了 Scikit-learn 文档中显示的示例 here
import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
# separate target variable from dataset
target = wine['quality']
data = wine.drop('quality',axis = 1)
# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)
for train_index, test_index in sss:
xtrain, xtest = data[train_index], data[test_index]
ytrain, ytest = target[train_index], target[test_index]
# Check target series for distribution of classes
ytrain.value_counts()
ytest.value_counts()
但是,在 运行 这个脚本上,我收到以下错误:
IndexError: indices are out-of-bounds
有人可以指出我在这里做错了什么吗?谢谢!
您 运行 进入 Pandas DataFrame
索引与 NumPy ndarray
索引的不同约定。数组 train_index
和 test_index
是行索引的集合。但是 data
是一个 Pandas DataFrame
对象,当您对该对象使用单个索引时,如 data[train_index]
,Pandas 期望 train_index
以包含 列 标签而不是行索引。您可以使用 .values
:
data_array = data.values
for train_index, test_index in sss:
xtrain, xtest = data_array[train_index], data_array[test_index]
ytrain, ytest = target[train_index], target[test_index]
或使用 Pandas .iloc
访问器:
for train_index, test_index in sss:
xtrain, xtest = data.iloc[train_index], data.iloc[test_index]
ytrain, ytest = target[train_index], target[test_index]
我倾向于第二种方法,因为它给出 xtrain
和 xtest
类型 DataFrame
而不是 ndarray
,因此保留了列标签。