pandas 中各组的中位数插补(处理 NaN 的组中位数)
median imputation by groups in pandas (handling group medians that are NaN)
我有以下 DataFrame 序列:
train = {'NAME_EDUCATION_TYPE': {5: 'Secondary / secondary special',
6: 'Higher education',
7: 'Higher education',
8: 'Secondary / secondary special',
9: 'Secondary / secondary special',
10: 'Higher education',
11: 'Secondary / secondary special',
12: 'Secondary / secondary special',
13: 'Secondary / secondary special',
14: 'Secondary / secondary special'},
'OCCUPATION_TYPE': {5: 'Laborers',
6: 'Accountants',
7: 'Managers',
8: nan,
9: 'Laborers',
10: 'Core staff',
11: nan,
12: 'Laborers',
13: 'Drivers',
14: 'Laborers'},
'AGE_GROUP': {5: '45-60',
6: '21-45',
7: '45-60',
8: '45-60',
9: '21-45',
10: '21-45',
11: '45-60',
12: '21-45',
13: '21-45',
14: '21-45'},
'DAYS_EMPLOYED': {5: -1588.0,
6: -3130.0,
7: -449.0,
8: nan,
9: -2019.0,
10: -679.0,
11: nan,
12: -2717.0,
13: -3028.0,
14: -203.0},
'DAYS_EMPLOYED_ANOM': {5: False,
6: False,
7: False,
8: True,
9: False,
10: False,
11: True,
12: False,
13: False,
14: False},
'DAYS_LAST_PHONE_CHANGE': {5: -2536.0,
6: -1562.0,
7: -1070.0,
8: 0.0,
9: -1673.0,
10: -844.0,
11: -2396.0,
12: -2370.0,
13: -4.0,
14: -188.0}}
我在DAYS_EMPLOYED列中有几个NaN。它们在 DAYS_EMPLOYED_ANOM 列中标记为“True”。
我想通过以下列组使用 DAYS_EMPLOYED 的中位数来估算这些 NaN :NAME_EDUCATION_TYPE、OCCUPATION_TYPE 和 AGE_GROUP
我相信这可以在 pandas 中用几行代码完成,但我想不出来。我尝试应用我在一个系列的平均插补示例中找到的以下代码,但 NaN 值没有被插补。
fill_median = lambda g: g.fillna(g.median())
train.loc[train['DAYS_EMPLOYED_ANOM'] == True,'DAYS_EMPLOYED'] = train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'])['DAYS_EMPLOYED'].apply(fill_median)`
我也尝试过应用此 post 中的代码但没有成功:
你可以这样做:
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False)
['DAYS_EMPLOYED']
.apply(lambda x: x.fillna(x.median()))
)
但是请注意,这不适用于您的特定数据集,因为您需要每组至少有一个非 NaN 值才能计算中位数。
您可以改用人口中位数:
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False)
['DAYS_EMPLOYED']
.apply(lambda x: x.fillna(train['DAYS_EMPLOYED'].median()))
)
这是一种尝试计算组中位数的混合方法,否则会退回到第一个人口:
def median(s):
m = s.median()
if np.isnan(m):
m = train['DAYS_EMPLOYED'].median()
return m
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False
)
['DAYS_EMPLOYED'].apply(lambda x: x.fillna(median(s)))
)
我有以下 DataFrame 序列:
train = {'NAME_EDUCATION_TYPE': {5: 'Secondary / secondary special',
6: 'Higher education',
7: 'Higher education',
8: 'Secondary / secondary special',
9: 'Secondary / secondary special',
10: 'Higher education',
11: 'Secondary / secondary special',
12: 'Secondary / secondary special',
13: 'Secondary / secondary special',
14: 'Secondary / secondary special'},
'OCCUPATION_TYPE': {5: 'Laborers',
6: 'Accountants',
7: 'Managers',
8: nan,
9: 'Laborers',
10: 'Core staff',
11: nan,
12: 'Laborers',
13: 'Drivers',
14: 'Laborers'},
'AGE_GROUP': {5: '45-60',
6: '21-45',
7: '45-60',
8: '45-60',
9: '21-45',
10: '21-45',
11: '45-60',
12: '21-45',
13: '21-45',
14: '21-45'},
'DAYS_EMPLOYED': {5: -1588.0,
6: -3130.0,
7: -449.0,
8: nan,
9: -2019.0,
10: -679.0,
11: nan,
12: -2717.0,
13: -3028.0,
14: -203.0},
'DAYS_EMPLOYED_ANOM': {5: False,
6: False,
7: False,
8: True,
9: False,
10: False,
11: True,
12: False,
13: False,
14: False},
'DAYS_LAST_PHONE_CHANGE': {5: -2536.0,
6: -1562.0,
7: -1070.0,
8: 0.0,
9: -1673.0,
10: -844.0,
11: -2396.0,
12: -2370.0,
13: -4.0,
14: -188.0}}
我在DAYS_EMPLOYED列中有几个NaN。它们在 DAYS_EMPLOYED_ANOM 列中标记为“True”。 我想通过以下列组使用 DAYS_EMPLOYED 的中位数来估算这些 NaN :NAME_EDUCATION_TYPE、OCCUPATION_TYPE 和 AGE_GROUP
我相信这可以在 pandas 中用几行代码完成,但我想不出来。我尝试应用我在一个系列的平均插补示例中找到的以下代码,但 NaN 值没有被插补。
fill_median = lambda g: g.fillna(g.median())
train.loc[train['DAYS_EMPLOYED_ANOM'] == True,'DAYS_EMPLOYED'] = train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'])['DAYS_EMPLOYED'].apply(fill_median)`
我也尝试过应用此 post 中的代码但没有成功:
你可以这样做:
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False)
['DAYS_EMPLOYED']
.apply(lambda x: x.fillna(x.median()))
)
但是请注意,这不适用于您的特定数据集,因为您需要每组至少有一个非 NaN 值才能计算中位数。
您可以改用人口中位数:
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False)
['DAYS_EMPLOYED']
.apply(lambda x: x.fillna(train['DAYS_EMPLOYED'].median()))
)
这是一种尝试计算组中位数的混合方法,否则会退回到第一个人口:
def median(s):
m = s.median()
if np.isnan(m):
m = train['DAYS_EMPLOYED'].median()
return m
train['DAYS_EMPLOYED'] = (train.groupby(['NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_GROUP'],
dropna=False
)
['DAYS_EMPLOYED'].apply(lambda x: x.fillna(median(s)))
)