比较每个队列的压缩分布

Question

如何轻松比较多个同类群组的分布？

通常，https://seaborn.pydata.org/generated/seaborn.distplot.html 是直观比较分布的好工具。但是，由于我的数据集的大小，我需要压缩它并只保留计数。

创建为：

SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender

其中 compress_distributionUDF 只需要一个元组列表和 returns 每组的计数。

这给我留下了一个列表

Row(distribution_value=60.0, count=314251, target_y_n=0)

嵌套在 pandas.Series 中，但每个队列一个。

基本上，它类似于：

pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})

我想知道如何比较分布：

在 0 与 target_y_n

1

多个群组

以一种视觉上仍然可以理解的方式，而不仅仅是一团糟。

编辑

对于单个队列可能是答案，但如何比较多个队列（不仅仅是在一个循环中），因为这会导致要比较的地块太多？

Answer 1

我仍然很困惑，但我们可以从这里开始，看看它的发展方向。从你的例子来看，我关注的是 baz，因为我不清楚 foo 和 bar 是什么（我假设是队列）。
因此，让我们关注 baz 并根据 target_y_n.

绘制不同的分布

sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)

sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)

plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])

sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)

最后尝试查看 FacetGrid class 以扩展您的比较（参见 here）。

g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

在你的情况下你会得到类似的东西：

g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

还有一个 qqplot 选项：

from scipy import stats
def qqplot(x, y, **kwargs):
     _, xr = stats.probplot(x, fit=False)
     _, yr = stats.probplot(y, fit=False)
 plt.scatter(xr, yr, **kwargs)

g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')

比较每个队列的压缩分布

comapring compressed distribution per cohort

distribution

matplotlib

pandas

seaborn

编辑