绘制象限图以根据 X 和 Y 的平均值区分 4 组人口并找到最终计数

Plotting quadrant chart to differntiate population in 4 groups based on mean values of X & Y and find the final count

开始学习如何在 python 上绘制数据,我需要帮助来实现以下目标:

我有以下示例 df6:

df6 = pd.DataFrame({
                   'emails': [50, 60 ,30, 40, 90, 10, 0,85 ],
                   'delivered': [20, 16 ,6, 15, 66, 6, 0,55 ]
                   })

df6

看起来像:

    emails  delivered
0       50  20
1       60  16
2       30  6
3       40  15
4       90  66
5       10  6
6       0   0
7       85  55

我需要在 4 象限图表中绘制 emails VS delivered。 X & Y 范围将稍微超出最大值,横截面将是两列的平均值。

到目前为止我所做的是使用 describe() 获取 df6 的值然后:

fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

plt.axhline(y=45.6, color="black", linestyle="--")
plt.axvline(x=23, color="black", linestyle="--")

plt.plot(df6['delivered'],df6['emails'],"o")
plt.xlim([0, df6['delivered'].max()+20])
plt.ylim([0, df6['emails'].max()+20])
plt.show()

到目前为止我得到了以下输出:

我正在寻找的是将图表分成 4 个分散的组,并用四分之一的总数标记每个组:

您只是缺少设置 left/bottom-spines 位置的代码

import pandas as pd, numpy as np
df6 = pd.DataFrame({'emails': [50, 60 ,30, 40, 90, 10, 0,85 ],
                    'delivered': [20, 16 ,6, 15, 66, 6, 0,55 ]})

plt.plot(df6['delivered'],df6['emails'],"o")

count = np.count_nonzero(
            (df6['emails'] < df6['delivered'].mean())&
            (df6['delivered'] < df6['emails'].mean()) ) 
plt.annotate('count: %s'%count,(5,60))

plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_position(('data',df6['delivered'].mean()))
plt.gca().spines['bottom'].set_position(('data',df6['emails'].mean()))

因此,要在您的绘图中使用该方法,您可以从简单地修改这两行开始:

plt.axhline(y=df6['emails'].mean(), color="black", linestyle="--")
plt.axvline(x=df6['delivered'].mean(), color="black", linestyle="--")

然后我们可以使用 pd.value_counts 来计算计数:

counts = df6.transform(lambda s: s >= s.mean()).value_counts()
pos = df6.agg(['min', 'max'])

这里counts包含了每对above/below的值,意思是:

emails  delivered
False   False        4
True    False        2
        True         2

pos包含放置框的x/y(或email/delivered)坐标:

     emails  delivered
min       0          0
max      90         66

因此您可以调整pos来更改注释位置。

最后要在图上做注解:

for (eml, dlv), num in counts.iteritems():
    ax.text(s=f'count: {num}',
        x=pos.loc['max' if dlv else 'min', 'delivered'],
        y=pos.loc['max' if eml else 'min', 'emails'],
        ha='right' if dlv else 'left',
        va='top' if eml else 'bottom',
    )

这是另一个解决方案,具有更对称的图形:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(
    {
        "emails": [50, 60, 30, 40, 90, 10, 0, 85],
        "delivered": [20, 16, 6, 15, 66, 6, 0, 55],
    }
)

plt.plot(df["delivered"], df["emails"], "o")
plt.gca().spines["top"].set_visible(False)
plt.gca().spines["right"].set_visible(False)
plt.gca().spines["left"].set_position(("data", df["delivered"].mean()))
plt.gca().spines["bottom"].set_position(("data", df["emails"].mean()))


def get_lims(df, column, w=0.1):
    mean = df[column].mean()
    max_diff = max(
        abs(df[column].max() - mean),
        abs(df[column].min() - mean),
    )
    return [mean - max_diff - max_diff * w, mean + max_diff + max_diff * w]


plt.xlim(get_lims(df, "delivered"))
plt.ylim(get_lims(df, "emails"))
plt.show()

我发现在绘图之前对数据进行归一化更容易...更新:用计数弄乱了一些东西,但是代码在这里分析我的错误。

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scale = scaler.fit(df6)

# normalize the sen_matrix
norm_df = pd.DataFrame(scale.transform(df6), columns=df6.columns)

quadrant_1 = sum(np.logical_and(norm_df['emails'] < 0, norm_df['delivered'] < 0))
display(quadrant_1)

quadrant_2 = sum(np.logical_and(norm_df['emails'] > 0, norm_df['delivered'] < 0))
display(quadrant_2)

quadrant_3 = sum(np.logical_and(norm_df['emails'] < 0, norm_df['delivered'] > 0))
display(quadrant_3)

quadrant_4 = sum(np.logical_and(norm_df['emails'] > 0, norm_df['delivered'] > 0))
display(quadrant_4)

fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

plt.axhline(y=0, color="black", linestyle="--")
plt.axvline(x=0, color="black", linestyle="--")

plt.plot(norm_df['delivered'],norm_df['emails'],"o")
plt.gca().spines['bottom'].set_visible(False)
plt.gca().spines['left'].set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.gca().axes.get_yaxis().set_visible(False)
plt.text(0,-2.1,'Delivered',horizontalalignment='center', verticalalignment='center')
plt.text(-2.1,0,'Emails', horizontalalignment='center', verticalalignment='center', rotation=90)

plt.text(1,1,'Count: ' + str(quadrant_1),horizontalalignment='center', verticalalignment='center')
plt.text(-1,1,'Count: ' + str(quadrant_2), horizontalalignment='center', verticalalignment='center')
plt.text(-1,-1,'Count: ' + str(quadrant_3),horizontalalignment='center', verticalalignment='center')
plt.text(1,-1,'Count: ' + str(quadrant_4), horizontalalignment='center', verticalalignment='center')


plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.show()