如何从 mnist 数据的原始大小创建样本子集,同时保持所有 10 类
how to creat a subset of sample from original size of mnist data, while keeping all 10 classes
假设X,Y = load_mnist()
其中 X 和 Y 是包含整个 mnist 的张量。现在我想要更小比例的数据来使我的代码 运行 更快,但我需要将所有 10 类 保持在那里并且以平衡的方式。有没有简单的方法来做到这一点?
scikit-learn 的 train_test_split
旨在将数据拆分为训练和测试 类,但您可以使用它来使用 [=] 创建数据集的 "balanced" 子集12=] 参数。您可以只指定所需的 train/test 大小比例,从而获得更小的分层数据样本。你的情况:
from sklearn.model_selection import train_test_split
X_1, X_2, Y_1, Y_2 = train_test_split(X, Y, stratify=Y, test_size=0.5)
如果您想通过更多控制来做到这一点,您可以使用 numpy.random.randint
生成子集大小的索引并对原始数组进行采样,如以下代码片段所示:
# input data, assume that you've 10K samples
In [77]: total_samples = 10000
In [78]: X, Y = np.random.random_sample((total_samples, 784)), np.random.randint(0, 10, total_samples)
# out of these 10K, we want to pick only 500 samples as a subset
In [79]: subset_size = 500
# generate uniformly distributed indices, of size `subset_size`
In [80]: subset_idx = np.random.choice(total_samples, subset_size)
# simply index into the original arrays to obtain the subsets
In [81]: X_subset, Y_subset = X[subset_idx], Y[subset_idx]
In [82]: X_subset.shape, Y_subset.shape
Out[82]: ((500, 784), (500,))
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=Ture, test_size=0.33, random_state=42)
分层会保证类.
的比例
如果你想做K折那么
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
检查 here sklearn 文档。
假设X,Y = load_mnist()
其中 X 和 Y 是包含整个 mnist 的张量。现在我想要更小比例的数据来使我的代码 运行 更快,但我需要将所有 10 类 保持在那里并且以平衡的方式。有没有简单的方法来做到这一点?
scikit-learn 的 train_test_split
旨在将数据拆分为训练和测试 类,但您可以使用它来使用 [=] 创建数据集的 "balanced" 子集12=] 参数。您可以只指定所需的 train/test 大小比例,从而获得更小的分层数据样本。你的情况:
from sklearn.model_selection import train_test_split
X_1, X_2, Y_1, Y_2 = train_test_split(X, Y, stratify=Y, test_size=0.5)
如果您想通过更多控制来做到这一点,您可以使用 numpy.random.randint
生成子集大小的索引并对原始数组进行采样,如以下代码片段所示:
# input data, assume that you've 10K samples
In [77]: total_samples = 10000
In [78]: X, Y = np.random.random_sample((total_samples, 784)), np.random.randint(0, 10, total_samples)
# out of these 10K, we want to pick only 500 samples as a subset
In [79]: subset_size = 500
# generate uniformly distributed indices, of size `subset_size`
In [80]: subset_idx = np.random.choice(total_samples, subset_size)
# simply index into the original arrays to obtain the subsets
In [81]: X_subset, Y_subset = X[subset_idx], Y[subset_idx]
In [82]: X_subset.shape, Y_subset.shape
Out[82]: ((500, 784), (500,))
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=Ture, test_size=0.33, random_state=42)
分层会保证类.
的比例如果你想做K折那么
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
检查 here sklearn 文档。