我可以使用 numpy 数组生成折叠以进行交叉验证吗?
Can I use a numpy array to generate folds for cross validation?
我想使用 numpy 数组为 k 折交叉验证任务构建折。取出测试片很容易,但我不知道如何 return 数组的其余部分,而省略了测试片。有没有有效的方法来做到这一点?
examples = range(50)
classes = range(50)
data = np.array(zip(classes,examples))
test_slice = data[5:10]
train_on_remainder = ??
您可以这样设置:
test_slice, remainder = np.split(data.copy(), [test_size], axis=0)
# run test
remainder[:test_size], test_slice = test_slice, remainder[:test_size].copy()
# run test
remainder[test_size:2*test_size], test_slice = test_slice, remainder[test_size:2*test_size].copy()
# etc.
我不认为你可以通过更少的复制获得它。
工作原理:
. full set: | 0 | 1 | 2 | 3 | 4 | 5 |
split (full copy) / \
tst / rem | 0 | | 1 | 2 | 3 | 4 | 5 |
run trial
| 1 | 2 | 3 | 4 | 5 |
swap tst and ^ |
first segment: | v
(partial copy) | 0 |
tst / rem | 1 | | 0 | 2 | 3 | 4 | 5 |
run trial
| 0 | 2 | 3 | 4 | 5 |
swap tst and ^ |
second segment: | v
(partial copy) | 1 |
tst / rem | 2 | | 0 | 1 | 3 | 4 | 5 |
run trial
| 0 | 1 | 3 | 4 | 5 |
swap tst and ^ |
third segment: | v
(partial copy) | 2 |
等正如您所看到的,它几乎是在字面上改变折叠。保存许多完整副本。
有点奇怪的问题,因为人们通常会使用 sklearn 的 train_test_split()
(如果可用)。
编辑:另一种方法可能是
r = np.arange(len(data))
trainX = data[r < 5 | r > 10]
一个有效的解决方案我不确定但试试这个
使用列表理解构建索引器。
def indx(n, test_slice):
return [x for x in range(n) if, x not in test_slice]
test_slice = set(range(5, 10))
trainX = data[indx(len(data), test_slice))]
当然你应该像 sklearn 那样的东西 train_test_split()
如果它可用的话。
split = np.vsplit(data, np.array([5,10]))
'''This will give you a list with 3 elements'''
test_slice = split[1]
train_slice = np.vstack((split[0],split[2]))
[[5 5] [6 6] [7 7] [8 8] [9 9]]
[[ 0 0] [ 1 1] [ 2 2] [ 3 3] [ 4 4] [10 10] [11 11] [12
12] [13 13] [14 14] [15 15] [16 16] [17 17] [18 18]
… [47 47] [48 48] [49 49]]
两种方法,在一维数组上演示:
In [64]: data = np.arange(20)
In [65]: test = data[5:10]
In [66]: rest = np.concatenate((data[:5],data[10:]),axis=0)
In [67]: rest
Out[67]: array([ 0, 1, 2, 3, 4, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
In [68]:
In [68]: mask = np.zeros(data.shape[0], dtype=bool)
In [69]: mask[5:10] = True
In [70]: test = data[mask]
In [71]: rest = data[~mask]
In [72]: rest
Out[72]: array([ 0, 1, 2, 3, 4, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
有一个np.delete
函数
In [75]: np.delete(data, np.arange(5,10))
Out[75]: array([ 0, 1, 2, 3, 4, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
它在内部使用我演示的两种方法之一。
如果您必须针对任意数量的拆分手动实施 k 折方法:我使用了以下解决方案(为交叉验证制作训练集和验证集):
#Generate indices on the row-wise length of the whole
#array
fold_indices = np.arange(x.shape[0])
#Shuffle the indices -- if you want, but not
#neccessary
np.random.shuffle(fold_indices)
#Split the indices into k-parts (returns a list of
#numpy arrays)
eval_indices = np.array_split(fold_indices, k)
for e in eval_indices:
#Define the evaluation set for the current fold
eval_set_x = x[e]
#exclude the upon parts indices from the
#whole array (similarly on the upon answers)
mask_eval = np.ones(x.shape[0], bool)
#Set indices of the eval set to false
mask_eval[e] = False
#Subset by the bool array:
train_set_x = x[mask_eval]
我想使用 numpy 数组为 k 折交叉验证任务构建折。取出测试片很容易,但我不知道如何 return 数组的其余部分,而省略了测试片。有没有有效的方法来做到这一点?
examples = range(50)
classes = range(50)
data = np.array(zip(classes,examples))
test_slice = data[5:10]
train_on_remainder = ??
您可以这样设置:
test_slice, remainder = np.split(data.copy(), [test_size], axis=0)
# run test
remainder[:test_size], test_slice = test_slice, remainder[:test_size].copy()
# run test
remainder[test_size:2*test_size], test_slice = test_slice, remainder[test_size:2*test_size].copy()
# etc.
我不认为你可以通过更少的复制获得它。
工作原理:
. full set: | 0 | 1 | 2 | 3 | 4 | 5 |
split (full copy) / \
tst / rem | 0 | | 1 | 2 | 3 | 4 | 5 |
run trial
| 1 | 2 | 3 | 4 | 5 |
swap tst and ^ |
first segment: | v
(partial copy) | 0 |
tst / rem | 1 | | 0 | 2 | 3 | 4 | 5 |
run trial
| 0 | 2 | 3 | 4 | 5 |
swap tst and ^ |
second segment: | v
(partial copy) | 1 |
tst / rem | 2 | | 0 | 1 | 3 | 4 | 5 |
run trial
| 0 | 1 | 3 | 4 | 5 |
swap tst and ^ |
third segment: | v
(partial copy) | 2 |
等正如您所看到的,它几乎是在字面上改变折叠。保存许多完整副本。
有点奇怪的问题,因为人们通常会使用 sklearn 的 train_test_split()
(如果可用)。
编辑:另一种方法可能是
r = np.arange(len(data))
trainX = data[r < 5 | r > 10]
一个有效的解决方案我不确定但试试这个 使用列表理解构建索引器。
def indx(n, test_slice):
return [x for x in range(n) if, x not in test_slice]
test_slice = set(range(5, 10))
trainX = data[indx(len(data), test_slice))]
当然你应该像 sklearn 那样的东西 train_test_split()
如果它可用的话。
split = np.vsplit(data, np.array([5,10]))
'''This will give you a list with 3 elements'''
test_slice = split[1]
train_slice = np.vstack((split[0],split[2]))
[[5 5] [6 6] [7 7] [8 8] [9 9]]
[[ 0 0] [ 1 1] [ 2 2] [ 3 3] [ 4 4] [10 10] [11 11] [12 12] [13 13] [14 14] [15 15] [16 16] [17 17] [18 18] … [47 47] [48 48] [49 49]]
两种方法,在一维数组上演示:
In [64]: data = np.arange(20)
In [65]: test = data[5:10]
In [66]: rest = np.concatenate((data[:5],data[10:]),axis=0)
In [67]: rest
Out[67]: array([ 0, 1, 2, 3, 4, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
In [68]:
In [68]: mask = np.zeros(data.shape[0], dtype=bool)
In [69]: mask[5:10] = True
In [70]: test = data[mask]
In [71]: rest = data[~mask]
In [72]: rest
Out[72]: array([ 0, 1, 2, 3, 4, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
有一个np.delete
函数
In [75]: np.delete(data, np.arange(5,10))
Out[75]: array([ 0, 1, 2, 3, 4, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
它在内部使用我演示的两种方法之一。
如果您必须针对任意数量的拆分手动实施 k 折方法:我使用了以下解决方案(为交叉验证制作训练集和验证集):
#Generate indices on the row-wise length of the whole
#array
fold_indices = np.arange(x.shape[0])
#Shuffle the indices -- if you want, but not
#neccessary
np.random.shuffle(fold_indices)
#Split the indices into k-parts (returns a list of
#numpy arrays)
eval_indices = np.array_split(fold_indices, k)
for e in eval_indices:
#Define the evaluation set for the current fold
eval_set_x = x[e]
#exclude the upon parts indices from the
#whole array (similarly on the upon answers)
mask_eval = np.ones(x.shape[0], bool)
#Set indices of the eval set to false
mask_eval[e] = False
#Subset by the bool array:
train_set_x = x[mask_eval]