pandas 中各组的中位数插补(处理 NaN 的组中位数)

median imputation by groups in pandas (handling group medians that are NaN)

我有以下 DataFrame 序列:

train = {'NAME_EDUCATION_TYPE': {5: 'Secondary / secondary special',
  6: 'Higher education',
  7: 'Higher education',
  8: 'Secondary / secondary special',
  9: 'Secondary / secondary special',
  10: 'Higher education',
  11: 'Secondary / secondary special',
  12: 'Secondary / secondary special',
  13: 'Secondary / secondary special',
  14: 'Secondary / secondary special'},
 'OCCUPATION_TYPE': {5: 'Laborers',
  6: 'Accountants',
  7: 'Managers',
  8: nan,
  9: 'Laborers',
  10: 'Core staff',
  11: nan,
  12: 'Laborers',
  13: 'Drivers',
  14: 'Laborers'},
 'AGE_GROUP': {5: '45-60',
  6: '21-45',
  7: '45-60',
  8: '45-60',
  9: '21-45',
  10: '21-45',
  11: '45-60',
  12: '21-45',
  13: '21-45',
  14: '21-45'},
 'DAYS_EMPLOYED': {5: -1588.0,
  6: -3130.0,
  7: -449.0,
  8: nan,
  9: -2019.0,
  10: -679.0,
  11: nan,
  12: -2717.0,
  13: -3028.0,
  14: -203.0},
 'DAYS_EMPLOYED_ANOM': {5: False,
  6: False,
  7: False,
  8: True,
  9: False,
  10: False,
  11: True,
  12: False,
  13: False,
  14: False},
 'DAYS_LAST_PHONE_CHANGE': {5: -2536.0,
  6: -1562.0,
  7: -1070.0,
  8: 0.0,
  9: -1673.0,
  10: -844.0,
  11: -2396.0,
  12: -2370.0,
  13: -4.0,
  14: -188.0}}

我在DAYS_EMPLOYED列中有几个NaN。它们在 DAYS_EMPLOYED_ANOM 列中标记为“True”。 我想通过以下列组使用 DAYS_EMPLOYED 的中位数来估算这些 NaN :NAME_EDUCATION_TYPE、OCCUPATION_TYPE 和 AGE_GROUP

我相信这可以在 pandas 中用几行代码完成,但我想不出来。我尝试应用我在一个系列的平均插补示例中找到的以下代码,但 NaN 值没有被插补。

fill_median = lambda g: g.fillna(g.median())
train.loc[train['DAYS_EMPLOYED_ANOM'] == True,'DAYS_EMPLOYED'] = train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'])['DAYS_EMPLOYED'].apply(fill_median)`

我也尝试过应用此 post 中的代码但没有成功:

你可以这样做:

train['DAYS_EMPLOYED'] =  (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
                                         dropna=False)
                                ['DAYS_EMPLOYED']
                                .apply(lambda x: x.fillna(x.median()))
                          )

但是请注意,这不适用于您的特定数据集,因为您需要每组至少有一个非 NaN 值才能计算中位数。

您可以改用人口中位数:

train['DAYS_EMPLOYED'] =  (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
                                         dropna=False)
                                ['DAYS_EMPLOYED']
                                .apply(lambda x: x.fillna(train['DAYS_EMPLOYED'].median()))
                          )

这是一种尝试计算组中位数的混合方法,否则会退回到第一个人口:

def median(s):
    m = s.median()
    if np.isnan(m):
        m = train['DAYS_EMPLOYED'].median()
    return m

train['DAYS_EMPLOYED'] =  (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
                                         dropna=False
                                        )
                                ['DAYS_EMPLOYED'].apply(lambda x: x.fillna(median(s)))
                          )