如何绘制两组数据的箱线图

How to plot boxplots for two groups of data

我正在用 pandas 绘制两个不同的箱线图:

plt.figure()
df['mean_train_score_error'] = [1] - df['mean_train_score']
df.boxplot(column=['mean_train_score_error'], by='modelo',
                                        medianprops = medianprops,
                                         autorange=True,showfliers=False, patch_artist=True, 
                                         vert=True, showmeans=True,meanline=True)
plt.ylabel('Error: 1-F1 Score')
plt.title('Error de entrenamiento')
plt.suptitle('')



df['mean_test_score_error'] = [1] - df['mean_test_score']
df.boxplot(column=['mean_test_score_error'], by='modelo',
                                        medianprops = medianprops,
                                         autorange=True,showfliers=False, patch_artist=True, 
                                         vert=True, showmeans=True,meanline=True)

plt.ylabel('Error: 1-F1 Score')
plt.title('Error de validación')
plt.suptitle('')

我得到以下两个图:

问题是是否可以在同一个地块上绘制 6 个箱线图,并为每个地块的每三个箱线图使用不同的颜色?

  • 最简单的方法是将数据从宽格式转换为长格式,然后使用 hue 参数使用 seaborn 绘图。
  • pandas.wide_to_long
    • 必须有唯一的 ID,因此添加 id 列。
    • 被转换的列,必须有相似的stubnames,这就是我将error移到列名前面的原因。
      • 错误列名称将在一列中,而值将在单独的列中

导入和测试数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# setup data and dataframe
np.random.seed(365)
data = {'mod_lg': np.random.normal(0.3, .1, size=(30,)),
        'mod_rf': np.random.normal(0.05, .01, size=(30,)),
        'mod_bg': np.random.normal(0.02, 0.002, size=(30,)),
        'mean_train_score': np.random.normal(0.95, 0.3, size=(30,)),
        'mean_test_score': np.random.normal(0.86, 0.5, size=(30,))}

df = pd.DataFrame(data)
df['error_mean_test_score'] = [1] - df['mean_test_score']
df['error_mean_train_score'] = [1] - df['mean_train_score']
df["id"] = df.index

df = pd.wide_to_long(df, stubnames='mod', i='id', j='mode', sep='_', suffix='\D+').reset_index()
df["id"] = df.index

# display dataframe: this is probably what your dataframe looks like to generate your current plots
   id mode  mean_train_score  error_mean_test_score  mean_test_score  error_mean_train_score       mod
0   0   lg          0.663855              -0.343961         1.343961                0.336145  0.316792
1   1   lg          0.990114               0.472847         0.527153                0.009886  0.352351
2   2   lg          1.179775               0.324748         0.675252               -0.179775  0.381738
3   3   lg          0.693155               0.519526         0.480474                0.306845  0.470385
4   4   lg          1.191048              -0.128033         1.128033               -0.191048  0.085305

转换和绘图

  • error_score_name 列包含来自 error_mean_test_score & error_mean_train_score
  • 的后缀
  • error_score_value 列包含值。
# convert df error columns to long format
dfl = pd.wide_to_long(df, stubnames='error', i='id', j='score', sep='_', suffix='\D+').reset_index(level=1)
dfl.rename(columns={'score': 'error_score_name', 'error': 'error_score_value'}, inplace=True)

# display dfl

   error_score_name  mean_train_score       mod  mean_test_score mode  error_score_value
id                                                                                      
0   mean_test_score          0.663855  0.316792         1.343961   lg          -0.343961
1   mean_test_score          0.990114  0.352351         0.527153   lg           0.472847
2   mean_test_score          1.179775  0.381738         0.675252   lg           0.324748
3   mean_test_score          0.693155  0.470385         0.480474   lg           0.519526
4   mean_test_score          1.191048  0.085305         1.128033   lg          -0.128033

# plot dfl
sns.boxplot(x='mode', y='error_score_value', data=dfl, hue='error_score_name')