在 for/if-else 循环中填充 np.nan 条件

Fill np.nan condition within for/if-else loop

我已经为此工作了一段时间,但似乎找不到我需要的答案。假设我有如下数据框。

我想做的是根据 df['home_work'] 列中的值填充 df['gender'] 的最后三行,特别是如果 home_work > 9,则 m,如果没有,则f。请记住,这只是一个编造的数据集,我保证没有冒犯任何人的意思!

enr = pd.DataFrame({'name_id':[1254, 1359, 1254, 1296, 1353, 2656], 
                   'enrollment_term':['spring 2018', 'spring 2018', 'fall 2018', 'spring 2018', 'spring 2018', 'fall 2020'],
                   'gpa_term': [2.93, np.nan, 1.65, 4.00, 3.95, 2.92],
                   'dog_owner':[0,1,1,1, 1, 0],
                   'salary':[50657, 90658, np.nan, 104352, np.nan, 102043],
                   'home_work':[34, np.nan, 12, 9, 8, 27],
                   'gender':['m','f','f',np.nan, np.nan, np.nan]})

enr

下面是我尝试的代码,但它在下面显示了错误:

for i in df['gender'].isna():
    if df['home_work'][i] > 9:
        df['gender'][i].fillna('m')
    else:
        df['gender'][i].fillna('f')
KeyError: False

非常感谢任何帮助,因为我已经为此工作了一段时间。我有一个 90K + 的数据集,我想调整这项工作,并想创建一个函数来简化这个过程,但遇到了速度障碍!

我运行遇到的问题是np.nan默认,如果不符合要求就给gender补一个值。想法?


# 已编辑

假设我有以下 df:

enr = pd.DataFrame({'name_id':[1254, 1359, 1254, 1296, 1353, 2656], 
                   'enrollment_term':['spring 2018', 'spring 2018', 'fall 2018', 'spring 2018', 'spring 2018', 'fall 2020'],
                   'gpa_term': [2.93, np.nan, 1.65, 4.00, 3.95, 2.92],
                   'dog_owner':[0,1,1,1, 1, 0],
                   'salary':[50657, 90658, np.nan, 104352, np.nan, 102043],
                   'home_work':[np.nan, np.nan, 0.7, 0.3, 0.64, 0.49],
                   'gender':[0, 1, 1,np.nan, np.nan, np.nan]})

我想根据 home_work 估算 enr['gender']。如果enr['home_work'] >= 0.5,则enr['gender'] == 0,否则(只要enr['home_work'] != np.nanenr['gender'] == 1

我不想要的是 enr[gender] 中的值插补,其中 enr['home_work']np.nan我尝试了很多不同的技术,但似乎都插补了 1。有什么想法吗?

使用numpy.where with Series.fillna:

enr['gender'] = np.where(enr['home_work'] > 9,  
                         enr['gender'].fillna('m'),
                         enr['gender'].fillna('f'))

或分别过滤2个掩码:

m = enr['gender'].isna()
enr.loc[m, 'gender'] = np.where(enr['home_work'] > 9,  'm',  'f')[m]

print (enr)
   name_id enrollment_term  gpa_term  dog_owner    salary  home_work gender
0     1254     spring 2018      2.93          0   50657.0         34      m
1     1359     spring 2018       NaN          1   90658.0         42      f
2     1254       fall 2018      1.65          1       NaN         12      f
3     1296     spring 2018      4.00          1  104352.0          9      f
4     1353     spring 2018      3.95          1       NaN          8      f
5     2656       fall 2020      2.92          0  102043.0         27      m

编辑:

m = enr['gender'].isna() & enr['home_work'].notna()
enr.loc[m, 'gender'] = np.where(enr['home_work'] >= 0.5, 0, 1)[m]
print (enr)
   name_id enrollment_term  gpa_term  dog_owner    salary  home_work  gender
0     1254     spring 2018      2.93          0   50657.0        NaN     0.0
1     1359     spring 2018       NaN          1   90658.0        NaN     1.0
2     1254       fall 2018      1.65          1       NaN       0.70     1.0
3     1296     spring 2018      4.00          1  104352.0       0.30     1.0
4     1353     spring 2018      3.95          1       NaN       0.64     0.0
5     2656       fall 2020      2.92          0  102043.0       0.49     1.0

让我们尝试 map 值和 where

df.gender=df.gender.where(df.gender.notna(),df.home_work.gt(9).map({True:'m',False:'f'})) 


df
   name_id enrollment_term  gpa_term  dog_owner    salary  home_work gender
0     1254     spring 2018      2.93          0   50657.0       34.0      m
1     1359     spring 2018       NaN          1   90658.0        NaN      f
2     1254       fall 2018      1.65          1       NaN       12.0      f
3     1296     spring 2018      4.00          1  104352.0        9.0      f
4     1353     spring 2018      3.95          1       NaN        8.0      f
5     2656       fall 2020      2.92          0  102043.0       27.0      m