如何在 seaborn 中将不同的数据迭代绘制为箱线图(不重叠)?
How to iteratively plot different data as boxplots in seaborn (without them overlapping)?
有没有办法使用 seaborn
的 sns.boxplot()
without having the boxplots overlap? (without combining datasets into a single pd.DataFrame()
)
迭代绘制数据
背景
有时在比较不同的(例如size/shape)数据集时,相互比较通常很有用,可以通过不同的共享变量对数据集进行分箱来进行(通过pd.cut()
and df.groupby()
,如下所示) .
之前,我通过使用 matplotlib
的 ax.boxplot()
循环单独的数据帧(通过提供 y 轴位置值作为position
确保箱线图不重叠的参数)。
例子
下面是一个简化的示例,显示了使用 sns.boxplot()
:
时的重叠图
import seaborn as sns
import random
import pandas as pd
import matplotlib.pyplot as plt
# Get the tips dataset and select a subset as an example
tips = sns.load_dataset("tips")
variable_to_bin_by = 'tip'
binned_variable = 'total_bill'
df = tips[[binned_variable, variable_to_bin_by] ]
# Create a second dataframe with different values and shape
df2 = pd.concat( [ df.copy() ] *5 )
# Use psuedo random numbers to convey that df2 is different to df
scale = [ random.uniform(0,2) for i in range(len(df2[binned_variable])) ]
df2[ binned_variable ] = df2[binned_variable].values * scale * 5
dfs = [ df, df2 ]
# Group the data by a list of bins
bins = [0, 1, 2, 3, 4]
for n, df in enumerate( dfs ):
gdf = df.groupby( pd.cut(df[variable_to_bin_by].values, bins ) )
data = [ i[1][binned_variable].values for i in gdf]
dfs[n] = pd.DataFrame( data, index = bins[:-1])
# Create an axis for both DataFrames to be plotted on
fig, ax = plt.subplots()
# Loop the DataFrames and plot
colors = ['red', 'black']
for n in range(2):
ax = sns.boxplot( data=dfs[n].T, ax=ax, width=0.2, orient='h',
color=colors[n] )
plt.ylabel( variable_to_bin_by )
plt.xlabel( binned_variable )
plt.show()
更多详情
我意识到上面的简化示例可以通过组合 DataFrame 并向 sns.boxplot()
提供 hue
参数来解决。
更新提供的 DataFrame 的索引也无济于事,因为随后使用了提供的最后一个 DataFrame 的 y 值。
提供 kwargs
参数(例如 kwargs={'positions': dfs[n].T.index}
)将不起作用,因为这会引发 TypeError
.
TypeError: boxplot() got multiple values for keyword argument
'positions'
设置 sns.boxplot()
的 dodge
参数到 True
没有解决这个问题。
有趣的是,我提出的"hack"可以应用到这里。
它使代码有点复杂,因为 seaborn 需要长格式数据框而不是宽格式数据框来使用色调嵌套。
# Get the tips dataset and select a subset as an example
tips = sns.load_dataset("tips")
df = tips[['total_bill', 'tip'] ]
# Group the data by
bins = [0, 1, 2, 3, 4]
gdf = df.groupby( pd.cut(df['tip'].values, bins ) )
data = [ i[1]['total_bill'].values for i in gdf]
df = pd.DataFrame( data , index = bins[:-1]).T
dfm = df.melt() # create a long-form database
dfm.loc[:,'dummy'] = 'dummy'
# Create a second, slightly different, DataFrame
dfm2 = dfm.copy()
dfm2.value = dfm.value*2
dfs = [ dfm, dfm2 ]
colors = ['red', 'black']
hue_orders = [['dummy','other'], ['other','dummy']]
# Create an axis for both DataFrames to be plotted on
fig, ax = plt.subplots()
# Loop the DataFrames and plot
for n in range(2):
ax = sns.boxplot( data=dfs[n], x='value', y='variable', hue='dummy', hue_order=hue_orders[n], ax=ax, width=0.2, orient='h',
color=colors[n] )
ax.legend_.remove()
plt.show()
有没有办法使用 seaborn
的 sns.boxplot()
without having the boxplots overlap? (without combining datasets into a single pd.DataFrame()
)
背景
有时在比较不同的(例如size/shape)数据集时,相互比较通常很有用,可以通过不同的共享变量对数据集进行分箱来进行(通过pd.cut()
and df.groupby()
,如下所示) .
之前,我通过使用 matplotlib
的 ax.boxplot()
循环单独的数据帧(通过提供 y 轴位置值作为position
确保箱线图不重叠的参数)。
例子
下面是一个简化的示例,显示了使用 sns.boxplot()
:
import seaborn as sns
import random
import pandas as pd
import matplotlib.pyplot as plt
# Get the tips dataset and select a subset as an example
tips = sns.load_dataset("tips")
variable_to_bin_by = 'tip'
binned_variable = 'total_bill'
df = tips[[binned_variable, variable_to_bin_by] ]
# Create a second dataframe with different values and shape
df2 = pd.concat( [ df.copy() ] *5 )
# Use psuedo random numbers to convey that df2 is different to df
scale = [ random.uniform(0,2) for i in range(len(df2[binned_variable])) ]
df2[ binned_variable ] = df2[binned_variable].values * scale * 5
dfs = [ df, df2 ]
# Group the data by a list of bins
bins = [0, 1, 2, 3, 4]
for n, df in enumerate( dfs ):
gdf = df.groupby( pd.cut(df[variable_to_bin_by].values, bins ) )
data = [ i[1][binned_variable].values for i in gdf]
dfs[n] = pd.DataFrame( data, index = bins[:-1])
# Create an axis for both DataFrames to be plotted on
fig, ax = plt.subplots()
# Loop the DataFrames and plot
colors = ['red', 'black']
for n in range(2):
ax = sns.boxplot( data=dfs[n].T, ax=ax, width=0.2, orient='h',
color=colors[n] )
plt.ylabel( variable_to_bin_by )
plt.xlabel( binned_variable )
plt.show()
更多详情
我意识到上面的简化示例可以通过组合 DataFrame 并向 sns.boxplot()
提供 hue
参数来解决。
更新提供的 DataFrame 的索引也无济于事,因为随后使用了提供的最后一个 DataFrame 的 y 值。
提供 kwargs
参数(例如 kwargs={'positions': dfs[n].T.index}
)将不起作用,因为这会引发 TypeError
.
TypeError: boxplot() got multiple values for keyword argument 'positions'
设置 sns.boxplot()
的 dodge
参数到 True
没有解决这个问题。
有趣的是,我
它使代码有点复杂,因为 seaborn 需要长格式数据框而不是宽格式数据框来使用色调嵌套。
# Get the tips dataset and select a subset as an example
tips = sns.load_dataset("tips")
df = tips[['total_bill', 'tip'] ]
# Group the data by
bins = [0, 1, 2, 3, 4]
gdf = df.groupby( pd.cut(df['tip'].values, bins ) )
data = [ i[1]['total_bill'].values for i in gdf]
df = pd.DataFrame( data , index = bins[:-1]).T
dfm = df.melt() # create a long-form database
dfm.loc[:,'dummy'] = 'dummy'
# Create a second, slightly different, DataFrame
dfm2 = dfm.copy()
dfm2.value = dfm.value*2
dfs = [ dfm, dfm2 ]
colors = ['red', 'black']
hue_orders = [['dummy','other'], ['other','dummy']]
# Create an axis for both DataFrames to be plotted on
fig, ax = plt.subplots()
# Loop the DataFrames and plot
for n in range(2):
ax = sns.boxplot( data=dfs[n], x='value', y='variable', hue='dummy', hue_order=hue_orders[n], ax=ax, width=0.2, orient='h',
color=colors[n] )
ax.legend_.remove()
plt.show()