对 Pandas Python 中的所有列重复汇总

Repeat summary for all columns in Pandas Python

我有一个 pandas 数据框,其中包含 100 多个分类列和两个数字列。例如,在下面的数据中,为简单起见,我只包含四个分类列:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'Gender': ['M','M','F','M','F','M','F','M','F','F'],
    'Class' : ['A','B','B','C','A','C','B','A','A','C'],
    'Class_2': ['A1','B2','B3','C5','B1','C2','B1','B1','C3','D1'],
    'District' : ['N','N','E','S','S','N','N','E','S','S']
})

df['X1'] = np.random.normal(1000, 55, 10)
df['X2'] = np.random.normal(100, 10, 10)

对于每个分类列(即 GenderClassClass_2District),我需要做以下总结:

   #Show the distribution of the column, both count and percent
    print((df["Gender"].value_counts(sort=False, normalize=False)))
    print((df["Gender"].value_counts(sort=False, normalize=True))*100)
    
    #Plot the histogram
    plt.figure(figsize=(9, 8))
    plt.hist(df['Gender'], color = 'blue', edgecolor = 'black',
             bins = 30)
    plt.xlabel("Gender")
    plt.ylabel("Count")
    plt.title("Gender distribution")
     

    #Aggregate sum of X1 and X2 by Gender, and find the ratio     
    #ratio by Gender
    var1 = pd.DataFrame(df.groupby('Gender')['X2', 'X1'].agg(['sum']).reset_index())
    var1['ratio'] = var1['X2']/var1['X1']
    print(var1)
     
    var1.plot('Gender', 'ratio', kind='bar',
                 colormap='Paired',
                title=' Ratio by Gender')

首先参数化绘图/统计数据,例如生成函数或过程:

def plot_stats(column):
   #Show the distribution of the column, both count and percent
    print((df[column].value_counts(sort=False, normalize=False)))
    print((df[column].value_counts(sort=False, normalize=True))*100)
    
    #Plot the histogram
    plt.figure(figsize=(9, 8))
    plt.hist(df[column], color = 'blue', edgecolor = 'black',
             bins = 30)
    plt.xlabel(column)
    plt.ylabel("Count")
    plt.title(f"{column} distribution")
     

    #Aggregate sum of X1 and X2 by Gender, and find the ratio     
    #ratio by Gender
    var1 = pd.DataFrame(df.groupby(column)['X2', 'X1'].agg(['sum']).reset_index())
    var1['ratio'] = var1['X2']/var1['X1']
    print(var1)
     
    var1.plot(column, 'ratio', kind='bar',
                 colormap='Paired',
                title= f' Ratio by {column}')
    #add below line to display each plot after printing output:
    plt.show()

然后 运行 循环:

for col in ['Gender','Class','Class_2','District']:
    plot_stats(col)

在 Jupyter Notebook 环境中工作时,请注意在打印输出后显示每个绘图需要 plt.show() 如上面的函数 plot_stats 所示。