PyTorch 中加权随机采样器背后的直觉

Question

我正在尝试使用 WeightedRandomSampler 来处理数据集中的不平衡 (class1: 2555, class 2: 227, class 3: 621, class4：2552张图片）。但是，我调试了这些步骤，但我不清楚其背后的直觉。我的目标标签采用单热编码向量的形式，如下所示。

train_labels.head(5)

我将标签转换为 class 索引为：

labels = np.argmax(train_labels.loc[:, 'none':'both'].values, axis=1)
train_labels = torch.from_numpy(labels)
train_labels
tensor([0, 0, 1,  ..., 1, 0, 0])

以下是我用来计算加权随机采样器的步骤。如果我对任何步骤的解释有误，请纠正我。

计算数据集中每个class的样本数

class_sample_count = np.array(train_labels.value_counts()) 
class_sample_count
array([2555, 2552,  621,  227])

计算每个class

相关的权重

weight = 1. / class_sample_count 
weight
array([0.00039139, 0.00039185, 0.00161031, 0.00440529])

计算数据集中每个样本的权重。

samples_weight = np.array(weight[train_labels])
print(samples_weight[1], samples_weight[2] )
0.0003913894324853229 0.00039184952978056425

将np.array转换为张量

     tensor([0.0004, 0.0004, 0.0004,  ..., 0.0004, 0.0004, 0.0004],
     dtype=torch.float64)

转换为张量后，所有样本在所有四个条目中似乎都具有相同的值？那么加权随机抽样是如何帮助处理不平衡数据集的呢？

我将不胜感激。谢谢。

Answer 1

这是因为您正在计算 one-hot 编码的权重，并且由于有四个组件（四个 classes），在索引 [=13] 之后每个实例最终得到四个相同的权重=].您具有相同权重的事实非常好，因为每个实例都应分配一个唯一的权重。对于采样器，这个权重对应于选择这个实例的概率。如果给定的 class 在数据集中很突出，则相关频率（即权重）将很低，因此 class 的实例将具有从数据集中采样的概率较低。

对于相当多的样本，此加权方案的目标是实现平衡采样，即使 class 表示不平衡。

如果你坚持使用单热编码，你可以只选择第一列：

>>> sample_weights = np.array(weight[train_labels])[:,0]

然后使用WeightedRandomSampler构造采样器：

>>> sampler = WeightedRandomSampler(sample_weights, len(train_labels))

您终于可以将它插入数据加载器了：

>>> DataLoader(dataset, batch_size, sampler=sampler)

PyTorch 中加权随机采样器背后的直觉

Intution behind weighted random sampler in PyTorch

python

pytorch