Pandas 带权重的样本

Question

我有 df，我想从中抽取一些关于某些变量分布的样本。假设 df['type'].value_counts(normalize=True) returns:

A 0.3
B 0.5
C 0.2

我想制作类似 sampledf = df.sample(weights=df['type'].value_counts(normalize=True)) 的东西，这样 sampledf ['type'].value_counts(normalize=True) 将 return 几乎相同的分布。如何在此处以频率传递字典？

Answer 1

Weights 必须将 series of the same length 作为原始 df，所以最好将其添加为一列：

df['freq'] = df.groupby('type')['type'].transform('count')
sampledf = df.sample(weights = df.freq)

或不添加列：

sampledf = df.sample(weights = df.groupby('type')['type'].transform('count'))

Answer 2

除了上面的答案之外，应该注意的是，如果你想对每种类型进行平均抽样，你应该将你的代码调整为：

df['freq'] = 1./df.groupby('type')['type'].transform('count')
sampledf = df.sample(weights = df.freq)

两个的情况下类。如果你有两个以上的类，你可以使用下面的代码概括权重计算：

w_j=n_samples / (n_classes * n_samples_j)

Answer 3

无需创建“与原始 df 长度相同的系列”。相反，您可以通过传递 value_counts 的分解输出来从每个组中采样，如下所示：

col = 'type'
sample_factor = .3
# sample size per group
weights = (df[col].value_counts() * sample_factor).astype(int)
df.groupby(col).apply(lambda g: g.sample(n=weights[g.name]))

Pandas 带权重的样本

Pandas sample with weights

sample

pandas