Pandas 基于另一列值的条件填充
Pandas conditional fillna based on another column values
我正在处理 bigmart 数据集,我想根据另一列的值替换一列的缺失值,实际上:
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 NaN 0-1000
4 High 0-1000
... ... ...
8518 High 2000-3000
8519 NaN 0-1000
8520 Small 1000-2000
8521 Medium 1000-2000
8522 Small 0-1000
So if train[“Outlet_Size”] value is a NaN and train[“sales_bin”] is “0-1000”
train[“Outlet_Size”] value shoud become “Small”
else == Medium
但是我真的不知道怎么写,而且我找到的所有信息都让我感到困惑
可以吗?怎么样?
非常感谢
根据sales_bin
等于0-1000
的条件从Small
和Medium
中选择Series.isna
to create boolean mask, then use np.where
+ Series.eq
到select:
m = df['Outlet_Size'].isna()
df.loc[m, 'Outlet_Size'] = np.where(df.loc[m, 'sales_bin'].eq('0-1000'), 'Small', 'Medium')
结果:
print(df)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
8518 High 2000-3000
8519 Small 0-1000
8520 Small 1000-2000
8521 Medium 1000-2000
8522 Small 0-1000
您可以使用 pandas.Series.map instead of numpy.where.
pandas.Series.map 对于这些简单的情况似乎更方便,这使得多重插补更容易和明确地使用字典(比如 {'0-1000': 'Small', '2000-3000': 'High'}
)。
numpy.where 旨在处理更多逻辑(例如:if a < 5 then a^2),这在 OP 用例中不是很有用,但要付出一些代价,比如使多重插补变得棘手处理(嵌套 if-else)。
步骤:
- 使用 pandas.Series.isna() ;
生成掩码以标记缺少 'Outlet_Size' 的 pandas.DataFrame 的子集
- 定义一个带有映射的字典,例如从“0-1000”到 'Small' ;
- 使用定义的字典作为 args 参数,使用 pandas.Series.map 替换定义的 pandas.DataFrame 子集中的 'Outlet_Size' 值。
- 使用 pandas.Series.fillna() 捕捉未映射的缺失值 'Outlet_Size' 并将它们估算为默认值。
示例:
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
5 High 2000-3000
6 Small 0-1000
7 Small 1000-2000
8 Medium 1000-2000
9 Small 0-1000
10 Medium 2000-3000
11 Medium 1000-2000
具有多重插补的示例:
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small', '2000-3000': 'High'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
5 High 2000-3000
6 Small 0-1000
7 Small 1000-2000
8 Medium 1000-2000
9 Small 0-1000
10 High 2000-3000
11 Medium 1000-2000
遵循 Shubham Sharma 的建议(使用 np.select)并使用该功能
“Item_Outlet_Sales”而不是“sales_bin”
所以 :
Outlet_Size Item_Outlet_Sales
0 Medium 3735.1380
1 Medium 443.4228
2 Medium 2097.2700
3 NaN 732.3800
4 High 994.7052
... ... ...
8518 High 2778.3834
8519 NaN 549.2850
8520 Small 1193.1136
8521 Medium 1845.5976
8522 Small 765.6700
missing = train["Outlet_Size"].isna()
condlist = [train.loc[missing, "Outlet_Size"] & train.loc[missing,'sales_bin'] <=1000,
train.loc[missing, "Outlet_Size"] & train.loc[missing,'sales_bin'] > 1000]
choicelist = ["Small", "Medium"] #PS, If I got it well it is possible to add as # many contiontions as wanted, as long condlist and choicelist has the same lenght
train.loc[missing, 'Outlet_Size'] = np.select(condlist, choicelist)
train["Outlet_Size"].value_counts(dropna=False)
Small 4798
Medium 2793
High 932
非常感谢您的建议和这个精彩论坛的存在:)
我正在处理 bigmart 数据集,我想根据另一列的值替换一列的缺失值,实际上:
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 NaN 0-1000
4 High 0-1000
... ... ...
8518 High 2000-3000
8519 NaN 0-1000
8520 Small 1000-2000
8521 Medium 1000-2000
8522 Small 0-1000
So if train[“Outlet_Size”] value is a NaN and train[“sales_bin”] is “0-1000”
train[“Outlet_Size”] value shoud become “Small”
else == Medium
但是我真的不知道怎么写,而且我找到的所有信息都让我感到困惑
可以吗?怎么样?
非常感谢
根据sales_bin
等于0-1000
的条件从Small
和Medium
中选择Series.isna
to create boolean mask, then use np.where
+ Series.eq
到select:
m = df['Outlet_Size'].isna()
df.loc[m, 'Outlet_Size'] = np.where(df.loc[m, 'sales_bin'].eq('0-1000'), 'Small', 'Medium')
结果:
print(df)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
8518 High 2000-3000
8519 Small 0-1000
8520 Small 1000-2000
8521 Medium 1000-2000
8522 Small 0-1000
您可以使用 pandas.Series.map instead of numpy.where.
pandas.Series.map 对于这些简单的情况似乎更方便,这使得多重插补更容易和明确地使用字典(比如 {'0-1000': 'Small', '2000-3000': 'High'}
)。
numpy.where 旨在处理更多逻辑(例如:if a < 5 then a^2),这在 OP 用例中不是很有用,但要付出一些代价,比如使多重插补变得棘手处理(嵌套 if-else)。
步骤:
- 使用 pandas.Series.isna() ; 生成掩码以标记缺少 'Outlet_Size' 的 pandas.DataFrame 的子集
- 定义一个带有映射的字典,例如从“0-1000”到 'Small' ;
- 使用定义的字典作为 args 参数,使用 pandas.Series.map 替换定义的 pandas.DataFrame 子集中的 'Outlet_Size' 值。
- 使用 pandas.Series.fillna() 捕捉未映射的缺失值 'Outlet_Size' 并将它们估算为默认值。
示例:
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
5 High 2000-3000
6 Small 0-1000
7 Small 1000-2000
8 Medium 1000-2000
9 Small 0-1000
10 Medium 2000-3000
11 Medium 1000-2000
具有多重插补的示例:
import pandas as pd
import numpy as np
fake_dataframe = pd.DataFrame({
'Outlet_Size' : ['Medium', 'Medium', 'Medium', np.nan, 'High', 'High', np.nan, 'Small', 'Medium', 'Small', np.nan, np.nan],
'sales_bin': ['3000-4000', '0-1000', '2000-3000', '0-1000', '0-1000', '2000-3000', '0-1000', '1000-2000', '1000-2000', '0-1000', '2000-3000', '1000-2000']
})
missing_mask = fake_dataframe['Outlet_Size'].isna()
mapping_dict = dict({'0-1000': 'Small', '2000-3000': 'High'})
fake_dataframe.loc[missing_mask, 'Outlet_Size'] = fake_dataframe.loc[missing_mask, 'sales_bin'].map(mapping_dict)
fake_dataframe['Outlet_Size'] = fake_dataframe['Outlet_Size'].fillna('Medium')
print(fake_dataframe)
Outlet_Size sales_bin
0 Medium 3000-4000
1 Medium 0-1000
2 Medium 2000-3000
3 Small 0-1000
4 High 0-1000
5 High 2000-3000
6 Small 0-1000
7 Small 1000-2000
8 Medium 1000-2000
9 Small 0-1000
10 High 2000-3000
11 Medium 1000-2000
遵循 Shubham Sharma 的建议(使用 np.select)并使用该功能 “Item_Outlet_Sales”而不是“sales_bin”
所以 :
Outlet_Size Item_Outlet_Sales
0 Medium 3735.1380
1 Medium 443.4228
2 Medium 2097.2700
3 NaN 732.3800
4 High 994.7052
... ... ...
8518 High 2778.3834
8519 NaN 549.2850
8520 Small 1193.1136
8521 Medium 1845.5976
8522 Small 765.6700
missing = train["Outlet_Size"].isna()
condlist = [train.loc[missing, "Outlet_Size"] & train.loc[missing,'sales_bin'] <=1000,
train.loc[missing, "Outlet_Size"] & train.loc[missing,'sales_bin'] > 1000]
choicelist = ["Small", "Medium"] #PS, If I got it well it is possible to add as # many contiontions as wanted, as long condlist and choicelist has the same lenght
train.loc[missing, 'Outlet_Size'] = np.select(condlist, choicelist)
train["Outlet_Size"].value_counts(dropna=False)
Small 4798
Medium 2793
High 932
非常感谢您的建议和这个精彩论坛的存在:)