对 Pandas Python 中的所有列重复汇总

Question

我有一个 pandas 数据框，其中包含 100 多个分类列和两个数字列。例如，在下面的数据中，为简单起见，我只包含四个分类列：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'Gender': ['M','M','F','M','F','M','F','M','F','F'],
    'Class' : ['A','B','B','C','A','C','B','A','A','C'],
    'Class_2': ['A1','B2','B3','C5','B1','C2','B1','B1','C3','D1'],
    'District' : ['N','N','E','S','S','N','N','E','S','S']
})

df['X1'] = np.random.normal(1000, 55, 10)
df['X2'] = np.random.normal(100, 10, 10)

对于每个分类列（即 Gender、Class、Class_2 和 District），我需要做以下总结：

   #Show the distribution of the column, both count and percent
    print((df["Gender"].value_counts(sort=False, normalize=False)))
    print((df["Gender"].value_counts(sort=False, normalize=True))*100)
    
    #Plot the histogram
    plt.figure(figsize=(9, 8))
    plt.hist(df['Gender'], color = 'blue', edgecolor = 'black',
             bins = 30)
    plt.xlabel("Gender")
    plt.ylabel("Count")
    plt.title("Gender distribution")
     

    #Aggregate sum of X1 and X2 by Gender, and find the ratio     
    #ratio by Gender
    var1 = pd.DataFrame(df.groupby('Gender')['X2', 'X1'].agg(['sum']).reset_index())
    var1['ratio'] = var1['X2']/var1['X1']
    print(var1)
     
    var1.plot('Gender', 'ratio', kind='bar',
                 colormap='Paired',
                title=' Ratio by Gender')

Answer 1

首先参数化绘图/统计数据，例如生成函数或过程：

def plot_stats(column):
   #Show the distribution of the column, both count and percent
    print((df[column].value_counts(sort=False, normalize=False)))
    print((df[column].value_counts(sort=False, normalize=True))*100)
    
    #Plot the histogram
    plt.figure(figsize=(9, 8))
    plt.hist(df[column], color = 'blue', edgecolor = 'black',
             bins = 30)
    plt.xlabel(column)
    plt.ylabel("Count")
    plt.title(f"{column} distribution")
     

    #Aggregate sum of X1 and X2 by Gender, and find the ratio     
    #ratio by Gender
    var1 = pd.DataFrame(df.groupby(column)['X2', 'X1'].agg(['sum']).reset_index())
    var1['ratio'] = var1['X2']/var1['X1']
    print(var1)
     
    var1.plot(column, 'ratio', kind='bar',
                 colormap='Paired',
                title= f' Ratio by {column}')
    #add below line to display each plot after printing output:
    plt.show()

然后运行循环：

for col in ['Gender','Class','Class_2','District']:
    plot_stats(col)

在 Jupyter Notebook 环境中工作时，请注意在打印输出后显示每个绘图需要 plt.show() 如上面的函数 plot_stats 所示。

对 Pandas Python 中的所有列重复汇总

Repeat summary for all columns in Pandas Python

python

repeat

pandas