使用带有分类数据的 seaborn barplot 的困难

Difficulties using seaborn barplot with categorical data

我在使用 seaborn 的 "categorical" 绘图函数实际绘制分类数据率时遇到了一个反复出现的问题。

我在这里制作了一个简单的例子,我可以发誓用它来使用 seaborn。我设法找到了使用虚拟变量的解决方法,但这并不总是很方便。有谁知道为什么我的 "Version 2" 条形图用例不起作用?

import pandas as pd
from pandas import DataFrame
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate some example data of labels and associated values
outcomes = ['A' for _ in range(50)] + \
           ['B' for _ in range(20)] + \
           ['C' for _ in range(5)] 
trial = range(len(outcomes))

df = DataFrame({'Trial': trial, 'Outcome': outcomes})

plt.close('all')

# Version 1: This works but is a non-ideal workaround

# Generate separate boolean columns for each outcome
df2 = pd.get_dummies(df.Outcome).astype(bool)

plt.figure()
sns.barplot(data=df2, estimator=lambda x: 100 * np.mean(x))
plt.title('Outcomes V1')
plt.ylabel('Percent Trials')
plt.ylim([0,100])
plt.show()

# Version 2: This doesn't work and results in the following error
# unsupported operand type(s) for /: 'str' and 'int' 
plt.figure()
sns.barplot(x='Outcome', data=df, estimator=lambda x: 100 * np.mean(x))
plt.title('Outcomes V2')
plt.ylabel('Percent Trials')
plt.ylim([0,100])
plt.show()

添加 y 参数对您有用:

sns.barplot(x='Outcome', y='Trial', data=df, estimator=lambda x: 100 * np.mean(x))

但是,在您的情况下,使用 sns.countplot 进行绘图更有意义(因为您希望将试验 10 视为一次出现,而不是实际的数字 10):

sns.countplot(x='Outcome', data=df)

当然,如果你想要百分比,你可以这样做:

sns.barplot(x='Outcome', y='Trial', data=df, estimator=lambda x: len(x) / len(df) * 100)  

说明

对于宽格式数据框(例如df2),您可以只将数据框传递给data参数,Seaborn会自动沿x轴绘制每个数字列.

对于长格式数据框(例如 df),您需要将参数传递给 xy 参数。

来自 sns.barplot 文档字符串(添加了 em):

Input data can be passed in a variety of formats, including:

  • Vectors of data represented as lists, numpy arrays, or pandas Series objects passed directly to the x, y, and/or hue parameters.
  • A "long-form" DataFrame, in which case the x, y, and hue variables will determine how the data are plotted.
  • A "wide-form" DataFrame, such that each numeric column will be plotted.
  • Anything accepted by plt.boxplot (e.g. a 2d array or list of vectors)