如何在不覆盖其他值的情况下执行列中 NaN 行的查找功能 python 3.7
How to perform lookup function of NaN rows in a column without overwrite the others value python 3.7
我的目标是查找信息:"Team" 从基于年 + 月 + 名称作为键的主数据集中,
如果有 NaN 结果,仅使用 "Year" + "Name" 作为第二个键来填充 NaN 行。
目标:
# dataset with lookuped column "Team"
Name Year Month KEY KEY_ND Team
0 Paul 2019 2 20192Paul 2019Paul A
1 Paul 2019 1 20191Paul 2019Paul A
2 Paul 2018 2 20182Paul 2018Paul C
3 Paul 2018 1 20181Paul 2018Paul B
4 Sue 2019 1 20191Sue 2019Sue A
到目前为止我已经尝试过的示例数据和脚本
Master = pd.DataFrame({"Name": ["Paul","Paul","Paul","Sue"],
"Team": ["A","B","C", "A"],
"Year": ["2019","2018","2018","2019"],
"Month": [1,1,2,1]
})
xx = pd.DataFrame({"Name": ["Paul","Paul","Paul","Paul","Sue"],
"Year": ["2019","2019","2018","2018","2019"],
"Month": [2,1,2,1,1]
})
# Make First Key
Master_KEY = Master.assign(KEY = Master['Year'].astype(str) +
Master['Month'].astype(str) + Master['Name'].astype(str))
# Make First Key
xx['KEY'] = xx['Year'] + xx['Month'].astype(str) + xx['Name']
# Make Second Key
Master_KEY = Master_KEY.assign(KEY_ND = Master['Year'].astype(str) + Master['Name'].astype(str))
# Make Second Key
xx['KEY_ND'] = xx['Year'] + xx['Name']
# First LOOKUP with first Key : Year + Month + Name
xx = pd.merge(xx, Master_KEY[['KEY', 'Team']], on = 'KEY', how = 'left')
# MASK for NaN
x_mask = xx['Team'].isnull()
# Second LOOKUP with second Key : Year + Name
xx.loc[x_mask, 'Team'] = pd.merge(xx,Master_KEY[['KEY_ND','Team']],
on = 'KEY_ND', how = 'left')
问题:
最后一个 Second LOOKUP 没有 return 作为例外结果
NaN 值仍然存在。
xx
Name Year Month KEY KEY_ND Team
0 Paul 2019 2 20192Paul 2019Paul NaN
1 Paul 2019 1 20191Paul 2019Paul A
2 Paul 2018 2 20182Paul 2018Paul C
3 Paul 2018 1 20181Paul 2018Paul B
4 Sue 2019 1 20191Sue 2019Sue A
这个脚本有问题:
# Second LOOKUP with second Key : Year + Name
xx.loc[x_mask, 'Team'] = pd.merge(xx,Master_KEY[['KEY_ND','Team']],
on = 'KEY_ND', how = 'left')
*显然这是一个冗长而低效的代码,感谢任何更好的建议,它是干净和快速的。
你可以使用双 DataFrame.merge
with different on
parameter and for second remove duplicates by DataFrame.drop_duplicates
and replace missing values by DataFrame.fillna
:
Master1 = Master[['Name','Year', 'Team']].drop_duplicates(subset=['Name','Year'])
df1 = xx[['Name','Year']].merge(Master1, how='left')
df2 = xx.merge(Master, on=['Name','Year', 'Month'], how='left').fillna({'Team': df1['Team']})
print (df2)
Name Year Month Team
0 Paul 2019 2 A
1 Paul 2019 1 A
2 Paul 2018 2 C
3 Paul 2018 1 B
4 Sue 2019 1 A
您的解决方案应更改为 Series.map
by keys columns with replace missing values by Series.fillna
:
Master = Master.assign(K1 = Master['Year'].astype(str) +
Master['Month'].astype(str) +
Master['Name'].astype(str),
K2 = Master['Year'].astype(str) +
Master['Name'].astype(str))
xx = xx.assign(K1 = xx['Year'].astype(str) +
xx['Month'].astype(str) +
xx['Name'].astype(str),
K2 = xx['Year'].astype(str) +
xx['Name'].astype(str))
s1 = xx['K1'].map(Master.set_index('K1')['Team'])
s2 = xx['K2'].map(Master.drop_duplicates('K2').set_index('K2')['Team'])
xx['Team'] = s1.fillna(s2)
print (xx)
Name Year Month K1 K2 Team
0 Paul 2019 2 20192Paul 2019Paul A
1 Paul 2019 1 20191Paul 2019Paul A
2 Paul 2018 2 20182Paul 2018Paul C
3 Paul 2018 1 20181Paul 2018Paul B
4 Sue 2019 1 20191Sue 2019Sue A
稍微干净和可读的解决方案如下。定义一个转换函数,它将根据您的条件在每一行中添加团队列,并将其应用于数据框。对于更复杂的条件,它是可读的并且易于扩展
def transform(x):
master_row = Master[(Master.Name==x.Name) & (Master.Year==x.Year)]
if len(master_row)>1:
temp_rows = master_row[master_row.Month == x.Month]
master_row = temp_rows if len(temp_rows)>=0 else master_row
x["Team"] = master_row.iloc[0].Team
return x
xx.apply(transform, axis=1)
我的目标是查找信息:"Team" 从基于年 + 月 + 名称作为键的主数据集中, 如果有 NaN 结果,仅使用 "Year" + "Name" 作为第二个键来填充 NaN 行。
目标:
# dataset with lookuped column "Team"
Name Year Month KEY KEY_ND Team
0 Paul 2019 2 20192Paul 2019Paul A
1 Paul 2019 1 20191Paul 2019Paul A
2 Paul 2018 2 20182Paul 2018Paul C
3 Paul 2018 1 20181Paul 2018Paul B
4 Sue 2019 1 20191Sue 2019Sue A
到目前为止我已经尝试过的示例数据和脚本
Master = pd.DataFrame({"Name": ["Paul","Paul","Paul","Sue"],
"Team": ["A","B","C", "A"],
"Year": ["2019","2018","2018","2019"],
"Month": [1,1,2,1]
})
xx = pd.DataFrame({"Name": ["Paul","Paul","Paul","Paul","Sue"],
"Year": ["2019","2019","2018","2018","2019"],
"Month": [2,1,2,1,1]
})
# Make First Key
Master_KEY = Master.assign(KEY = Master['Year'].astype(str) +
Master['Month'].astype(str) + Master['Name'].astype(str))
# Make First Key
xx['KEY'] = xx['Year'] + xx['Month'].astype(str) + xx['Name']
# Make Second Key
Master_KEY = Master_KEY.assign(KEY_ND = Master['Year'].astype(str) + Master['Name'].astype(str))
# Make Second Key
xx['KEY_ND'] = xx['Year'] + xx['Name']
# First LOOKUP with first Key : Year + Month + Name
xx = pd.merge(xx, Master_KEY[['KEY', 'Team']], on = 'KEY', how = 'left')
# MASK for NaN
x_mask = xx['Team'].isnull()
# Second LOOKUP with second Key : Year + Name
xx.loc[x_mask, 'Team'] = pd.merge(xx,Master_KEY[['KEY_ND','Team']],
on = 'KEY_ND', how = 'left')
问题:
最后一个 Second LOOKUP 没有 return 作为例外结果 NaN 值仍然存在。
xx
Name Year Month KEY KEY_ND Team
0 Paul 2019 2 20192Paul 2019Paul NaN
1 Paul 2019 1 20191Paul 2019Paul A
2 Paul 2018 2 20182Paul 2018Paul C
3 Paul 2018 1 20181Paul 2018Paul B
4 Sue 2019 1 20191Sue 2019Sue A
这个脚本有问题:
# Second LOOKUP with second Key : Year + Name
xx.loc[x_mask, 'Team'] = pd.merge(xx,Master_KEY[['KEY_ND','Team']],
on = 'KEY_ND', how = 'left')
*显然这是一个冗长而低效的代码,感谢任何更好的建议,它是干净和快速的。
你可以使用双 DataFrame.merge
with different on
parameter and for second remove duplicates by DataFrame.drop_duplicates
and replace missing values by DataFrame.fillna
:
Master1 = Master[['Name','Year', 'Team']].drop_duplicates(subset=['Name','Year'])
df1 = xx[['Name','Year']].merge(Master1, how='left')
df2 = xx.merge(Master, on=['Name','Year', 'Month'], how='left').fillna({'Team': df1['Team']})
print (df2)
Name Year Month Team
0 Paul 2019 2 A
1 Paul 2019 1 A
2 Paul 2018 2 C
3 Paul 2018 1 B
4 Sue 2019 1 A
您的解决方案应更改为 Series.map
by keys columns with replace missing values by Series.fillna
:
Master = Master.assign(K1 = Master['Year'].astype(str) +
Master['Month'].astype(str) +
Master['Name'].astype(str),
K2 = Master['Year'].astype(str) +
Master['Name'].astype(str))
xx = xx.assign(K1 = xx['Year'].astype(str) +
xx['Month'].astype(str) +
xx['Name'].astype(str),
K2 = xx['Year'].astype(str) +
xx['Name'].astype(str))
s1 = xx['K1'].map(Master.set_index('K1')['Team'])
s2 = xx['K2'].map(Master.drop_duplicates('K2').set_index('K2')['Team'])
xx['Team'] = s1.fillna(s2)
print (xx)
Name Year Month K1 K2 Team
0 Paul 2019 2 20192Paul 2019Paul A
1 Paul 2019 1 20191Paul 2019Paul A
2 Paul 2018 2 20182Paul 2018Paul C
3 Paul 2018 1 20181Paul 2018Paul B
4 Sue 2019 1 20191Sue 2019Sue A
稍微干净和可读的解决方案如下。定义一个转换函数,它将根据您的条件在每一行中添加团队列,并将其应用于数据框。对于更复杂的条件,它是可读的并且易于扩展
def transform(x):
master_row = Master[(Master.Name==x.Name) & (Master.Year==x.Year)]
if len(master_row)>1:
temp_rows = master_row[master_row.Month == x.Month]
master_row = temp_rows if len(temp_rows)>=0 else master_row
x["Team"] = master_row.iloc[0].Team
return x
xx.apply(transform, axis=1)