如何在不覆盖其他值的情况下执行列中 NaN 行的查找功能 python 3.7

How to perform lookup function of NaN rows in a column without overwrite the others value python 3.7

我的目标是查找信息:"Team" 从基于年 + 月 + 名称作为键的主数据集中, 如果有 NaN 结果,仅使用 "Year" + "Name" 作为第二个键来填充 NaN 行。

目标:

# dataset with lookuped column "Team"
Name    Year    Month   KEY         KEY_ND     Team
0   Paul    2019    2   20192Paul   2019Paul    A
1   Paul    2019    1   20191Paul   2019Paul    A
2   Paul    2018    2   20182Paul   2018Paul    C
3   Paul    2018    1   20181Paul   2018Paul    B
4   Sue     2019    1    20191Sue   2019Sue     A

到目前为止我已经尝试过的示例数据和脚本

Master = pd.DataFrame({"Name": ["Paul","Paul","Paul","Sue"],
                   "Team": ["A","B","C", "A"],
                   "Year": ["2019","2018","2018","2019"],
                   "Month": [1,1,2,1]
                  })

xx = pd.DataFrame({"Name": ["Paul","Paul","Paul","Paul","Sue"],
                   "Year": ["2019","2019","2018","2018","2019"],
                   "Month": [2,1,2,1,1]
                  })


# Make First Key
Master_KEY = Master.assign(KEY = Master['Year'].astype(str) + 
Master['Month'].astype(str) + Master['Name'].astype(str))

# Make First Key
xx['KEY'] = xx['Year'] + xx['Month'].astype(str) + xx['Name']

# Make Second Key
Master_KEY = Master_KEY.assign(KEY_ND = Master['Year'].astype(str) + Master['Name'].astype(str))

# Make Second Key
xx['KEY_ND'] = xx['Year'] + xx['Name']

# First LOOKUP with first Key : Year + Month + Name 
xx = pd.merge(xx, Master_KEY[['KEY', 'Team']], on = 'KEY', how = 'left')

# MASK for NaN
x_mask = xx['Team'].isnull()

# Second LOOKUP with second Key : Year + Name 
xx.loc[x_mask, 'Team'] = pd.merge(xx,Master_KEY[['KEY_ND','Team']],
                 on = 'KEY_ND', how = 'left')

问题:

最后一个 Second LOOKUP 没有 return 作为例外结果 NaN 值仍然存在。

xx
Name    Year    Month   KEY         KEY_ND  Team
0   Paul    2019    2   20192Paul   2019Paul    NaN
1   Paul    2019    1   20191Paul   2019Paul    A
2   Paul    2018    2   20182Paul   2018Paul    C
3   Paul    2018    1   20181Paul   2018Paul    B
4   Sue     2019    1   20191Sue    2019Sue     A

这个脚本有问题:

# Second LOOKUP with second Key : Year + Name 
xx.loc[x_mask, 'Team'] = pd.merge(xx,Master_KEY[['KEY_ND','Team']],
                 on = 'KEY_ND', how = 'left')

*显然这是一个冗长而低效的代码,感谢任何更好的建议,它是干净和快速的。

你可以使用双 DataFrame.merge with different on parameter and for second remove duplicates by DataFrame.drop_duplicates and replace missing values by DataFrame.fillna:

Master1 = Master[['Name','Year', 'Team']].drop_duplicates(subset=['Name','Year'])
df1 = xx[['Name','Year']].merge(Master1, how='left')
df2 = xx.merge(Master, on=['Name','Year', 'Month'], how='left').fillna({'Team': df1['Team']})
print (df2)
   Name  Year  Month Team
0  Paul  2019      2    A
1  Paul  2019      1    A
2  Paul  2018      2    C
3  Paul  2018      1    B
4   Sue  2019      1    A

您的解决方案应更改为 Series.map by keys columns with replace missing values by Series.fillna:

Master = Master.assign(K1 =  Master['Year'].astype(str) + 
                             Master['Month'].astype(str) + 
                             Master['Name'].astype(str),
                       K2 =  Master['Year'].astype(str) + 
                             Master['Name'].astype(str))
xx = xx.assign(K1 =  xx['Year'].astype(str) + 
                     xx['Month'].astype(str) + 
                     xx['Name'].astype(str),
               K2 =  xx['Year'].astype(str) + 
                     xx['Name'].astype(str))

s1 = xx['K1'].map(Master.set_index('K1')['Team'])
s2 = xx['K2'].map(Master.drop_duplicates('K2').set_index('K2')['Team'])
xx['Team'] = s1.fillna(s2)
print (xx)
   Name  Year  Month         K1        K2 Team
0  Paul  2019      2  20192Paul  2019Paul    A
1  Paul  2019      1  20191Paul  2019Paul    A
2  Paul  2018      2  20182Paul  2018Paul    C
3  Paul  2018      1  20181Paul  2018Paul    B
4   Sue  2019      1   20191Sue   2019Sue    A

稍微干净和可读的解决方案如下。定义一个转换函数,它将根据您的条件在每一行中添加团队列,并将其应用于数据框。对于更复杂的条件,它是可读的并且易于扩展

def transform(x):

    master_row = Master[(Master.Name==x.Name) & (Master.Year==x.Year)]
    if len(master_row)>1:
        temp_rows = master_row[master_row.Month == x.Month]
        master_row = temp_rows if len(temp_rows)>=0 else master_row

    x["Team"] = master_row.iloc[0].Team
    return x

xx.apply(transform, axis=1)