对 Pandas Python 中的所有列重复汇总
Repeat summary for all columns in Pandas Python
我有一个 pandas 数据框,其中包含 100 多个分类列和两个数字列。例如,在下面的数据中,为简单起见,我只包含四个分类列:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({
'Gender': ['M','M','F','M','F','M','F','M','F','F'],
'Class' : ['A','B','B','C','A','C','B','A','A','C'],
'Class_2': ['A1','B2','B3','C5','B1','C2','B1','B1','C3','D1'],
'District' : ['N','N','E','S','S','N','N','E','S','S']
})
df['X1'] = np.random.normal(1000, 55, 10)
df['X2'] = np.random.normal(100, 10, 10)
对于每个分类列(即 Gender
、Class
、Class_2
和 District
),我需要做以下总结:
#Show the distribution of the column, both count and percent
print((df["Gender"].value_counts(sort=False, normalize=False)))
print((df["Gender"].value_counts(sort=False, normalize=True))*100)
#Plot the histogram
plt.figure(figsize=(9, 8))
plt.hist(df['Gender'], color = 'blue', edgecolor = 'black',
bins = 30)
plt.xlabel("Gender")
plt.ylabel("Count")
plt.title("Gender distribution")
#Aggregate sum of X1 and X2 by Gender, and find the ratio
#ratio by Gender
var1 = pd.DataFrame(df.groupby('Gender')['X2', 'X1'].agg(['sum']).reset_index())
var1['ratio'] = var1['X2']/var1['X1']
print(var1)
var1.plot('Gender', 'ratio', kind='bar',
colormap='Paired',
title=' Ratio by Gender')
首先参数化绘图/统计数据,例如生成函数或过程:
def plot_stats(column):
#Show the distribution of the column, both count and percent
print((df[column].value_counts(sort=False, normalize=False)))
print((df[column].value_counts(sort=False, normalize=True))*100)
#Plot the histogram
plt.figure(figsize=(9, 8))
plt.hist(df[column], color = 'blue', edgecolor = 'black',
bins = 30)
plt.xlabel(column)
plt.ylabel("Count")
plt.title(f"{column} distribution")
#Aggregate sum of X1 and X2 by Gender, and find the ratio
#ratio by Gender
var1 = pd.DataFrame(df.groupby(column)['X2', 'X1'].agg(['sum']).reset_index())
var1['ratio'] = var1['X2']/var1['X1']
print(var1)
var1.plot(column, 'ratio', kind='bar',
colormap='Paired',
title= f' Ratio by {column}')
#add below line to display each plot after printing output:
plt.show()
然后 运行 循环:
for col in ['Gender','Class','Class_2','District']:
plot_stats(col)
在 Jupyter Notebook 环境中工作时,请注意在打印输出后显示每个绘图需要 plt.show()
如上面的函数 plot_stats
所示。
我有一个 pandas 数据框,其中包含 100 多个分类列和两个数字列。例如,在下面的数据中,为简单起见,我只包含四个分类列:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({
'Gender': ['M','M','F','M','F','M','F','M','F','F'],
'Class' : ['A','B','B','C','A','C','B','A','A','C'],
'Class_2': ['A1','B2','B3','C5','B1','C2','B1','B1','C3','D1'],
'District' : ['N','N','E','S','S','N','N','E','S','S']
})
df['X1'] = np.random.normal(1000, 55, 10)
df['X2'] = np.random.normal(100, 10, 10)
对于每个分类列(即 Gender
、Class
、Class_2
和 District
),我需要做以下总结:
#Show the distribution of the column, both count and percent
print((df["Gender"].value_counts(sort=False, normalize=False)))
print((df["Gender"].value_counts(sort=False, normalize=True))*100)
#Plot the histogram
plt.figure(figsize=(9, 8))
plt.hist(df['Gender'], color = 'blue', edgecolor = 'black',
bins = 30)
plt.xlabel("Gender")
plt.ylabel("Count")
plt.title("Gender distribution")
#Aggregate sum of X1 and X2 by Gender, and find the ratio
#ratio by Gender
var1 = pd.DataFrame(df.groupby('Gender')['X2', 'X1'].agg(['sum']).reset_index())
var1['ratio'] = var1['X2']/var1['X1']
print(var1)
var1.plot('Gender', 'ratio', kind='bar',
colormap='Paired',
title=' Ratio by Gender')
首先参数化绘图/统计数据,例如生成函数或过程:
def plot_stats(column):
#Show the distribution of the column, both count and percent
print((df[column].value_counts(sort=False, normalize=False)))
print((df[column].value_counts(sort=False, normalize=True))*100)
#Plot the histogram
plt.figure(figsize=(9, 8))
plt.hist(df[column], color = 'blue', edgecolor = 'black',
bins = 30)
plt.xlabel(column)
plt.ylabel("Count")
plt.title(f"{column} distribution")
#Aggregate sum of X1 and X2 by Gender, and find the ratio
#ratio by Gender
var1 = pd.DataFrame(df.groupby(column)['X2', 'X1'].agg(['sum']).reset_index())
var1['ratio'] = var1['X2']/var1['X1']
print(var1)
var1.plot(column, 'ratio', kind='bar',
colormap='Paired',
title= f' Ratio by {column}')
#add below line to display each plot after printing output:
plt.show()
然后 运行 循环:
for col in ['Gender','Class','Class_2','District']:
plot_stats(col)
在 Jupyter Notebook 环境中工作时,请注意在打印输出后显示每个绘图需要 plt.show()
如上面的函数 plot_stats
所示。