如何使用 Python Pandas 在特定切片中制作 DataFrame 切片和 "fillna"?
How to make a slice of DataFrame and "fillna" in specific slice using Python Pandas?
问题:让我们从 Kaggle 获取 Titanic 数据集。
我有包含 "Pclass"、"Sex" 和 "Age" 列的数据框。
我需要在 "Age" 列中用特定组的中位数填充 NaN。
如果是第一个 class 的女性,我想用第一个 class 个女性的中位数填充她的年龄,而不是整个年龄列的中位数。
问题是如何在某个切片中进行这种更改?
我试过了:
data['Age'][(data['Sex'] == 'female')&(data['Pclass'] == 1)&(data['Age'].isnull())].fillna(median)
其中 "median" 是我的值,但没有任何更改 "inplace=True" 没有帮助。
非常感谢!
我认为您需要按掩码过滤并返回:
data = pd.DataFrame({'a':list('aaaddd'),
'Sex':['female','female','male','female','female','male'],
'Pclass':[1,2,1,2,1,1],
'Age':[40,20,30,20,np.nan,np.nan]})
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 NaN 1 female d
5 NaN 1 male d
#boolean mask
mask1 = (data['Sex'] == 'female')&(data['Pclass'] == 1)
#get median by mask without NaNs
med = data.loc[mask1, 'Age'].median()
print (med)
40.0
#repalce NaNs
data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 40.0 1 female d
5 NaN 1 male d
什么相同:
mask2 = mask1 &(data['Age'].isnull())
data.loc[mask2, 'Age'] = med
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 40.0 1 female d
5 NaN 1 male d
编辑:
如果需要用中位数替换所有组 NaN
s:
data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 40.0 1 female d
5 30.0 1 male d
如果你想对每个组都做同样的事情,你可以使用这个技巧
data = pd.DataFrame({'a':list('aaaddd'),
'Sex':['female','female','male','female','female','male'],
'Pclass':[1,2,1,2,1,1],
'Age':[40,20,30,20, np.nan, np.nan]})
df = data.groupby(["Sex","Pclass"])["Age"].median().to_frame().reset_index()
df.rename(columns={"Age":"Med"}, inplace=True)
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["Med"], data["Age"])
更新:
# dummy dataframe
n = int(1e7)
data = pd.DataFrame({"Age":np.random.choice([10,20,20,30,30,40,np.nan], n),
"Pclass":np.random.choice([1,2,3], n),
"Sex":np.random.choice(["male","female"], n),
"a":np.random.choice(["a","b","c","d"], n)})
在我的机器上运行这个(和之前一样,没有重命名)
df = data.groupby(["Sex","Pclass"])["Age"].agg(['median']).reset_index()
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["median"], data["Age"])
CPU times: user 1.98 s, sys: 216 ms, total: 2.2 s
Wall time: 2.2 s
虽然面膜解决方案花费了:
for sex in ["male", "female"]:
for pclass in range(1,4):
mask1 =(data['Sex'] == sex)&(data['Pclass'] == pclass)
med = data.loc[mask1, 'Age'].median()
data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
CPU times: user 5.13 s, sys: 60 ms, total: 5.19 s
Wall time: 5.19 s
@jezrael 解决方案更快
data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
CPU times: user 1.34 s, sys: 92 ms, total: 1.44 s
Wall time: 1.44 s
我想在这里添加一个更有效的答案,因为它涉及的代码更少。本质上,如果您使用布尔条件对数据帧进行切片并在这些特定条件下使用 .fillna,只需使用赋值:
我将使用来自不同 Kaggle 竞赛的示例:
# Use a mask as suggested by jesrael. It's just neater:
mask1 = (test_df.Neighborhood == 'IDOTRR') & (test_df.MSZoning.isna())
mask2 = (test_df.Neighborhood == 'Mitchel') & (test_df.MSZoning.isna())
# Use the mask and assign the desired value
test_df.loc[mask1, 'MSZoning'] = 'RM'
test_df.loc[mask2, 'MSZoning'] = 'RL'
这与 jesrael 的回答不同,因为 he/she 使用 .fillna()
分配回屏蔽数据帧。如果您打算使用遮罩,并且心中有特定的值,则无需使用“.fillna()”
问题:让我们从 Kaggle 获取 Titanic 数据集。 我有包含 "Pclass"、"Sex" 和 "Age" 列的数据框。 我需要在 "Age" 列中用特定组的中位数填充 NaN。 如果是第一个 class 的女性,我想用第一个 class 个女性的中位数填充她的年龄,而不是整个年龄列的中位数。
问题是如何在某个切片中进行这种更改?
我试过了:
data['Age'][(data['Sex'] == 'female')&(data['Pclass'] == 1)&(data['Age'].isnull())].fillna(median)
其中 "median" 是我的值,但没有任何更改 "inplace=True" 没有帮助。
非常感谢!
我认为您需要按掩码过滤并返回:
data = pd.DataFrame({'a':list('aaaddd'),
'Sex':['female','female','male','female','female','male'],
'Pclass':[1,2,1,2,1,1],
'Age':[40,20,30,20,np.nan,np.nan]})
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 NaN 1 female d
5 NaN 1 male d
#boolean mask
mask1 = (data['Sex'] == 'female')&(data['Pclass'] == 1)
#get median by mask without NaNs
med = data.loc[mask1, 'Age'].median()
print (med)
40.0
#repalce NaNs
data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 40.0 1 female d
5 NaN 1 male d
什么相同:
mask2 = mask1 &(data['Age'].isnull())
data.loc[mask2, 'Age'] = med
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 40.0 1 female d
5 NaN 1 male d
编辑:
如果需要用中位数替换所有组 NaN
s:
data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 40.0 1 female d
5 30.0 1 male d
如果你想对每个组都做同样的事情,你可以使用这个技巧
data = pd.DataFrame({'a':list('aaaddd'),
'Sex':['female','female','male','female','female','male'],
'Pclass':[1,2,1,2,1,1],
'Age':[40,20,30,20, np.nan, np.nan]})
df = data.groupby(["Sex","Pclass"])["Age"].median().to_frame().reset_index()
df.rename(columns={"Age":"Med"}, inplace=True)
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["Med"], data["Age"])
更新:
# dummy dataframe
n = int(1e7)
data = pd.DataFrame({"Age":np.random.choice([10,20,20,30,30,40,np.nan], n),
"Pclass":np.random.choice([1,2,3], n),
"Sex":np.random.choice(["male","female"], n),
"a":np.random.choice(["a","b","c","d"], n)})
在我的机器上运行这个(和之前一样,没有重命名)
df = data.groupby(["Sex","Pclass"])["Age"].agg(['median']).reset_index()
data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
data["Age"] = np.where(data["Age"].isnull(), data["median"], data["Age"])
CPU times: user 1.98 s, sys: 216 ms, total: 2.2 s
Wall time: 2.2 s
虽然面膜解决方案花费了:
for sex in ["male", "female"]:
for pclass in range(1,4):
mask1 =(data['Sex'] == sex)&(data['Pclass'] == pclass)
med = data.loc[mask1, 'Age'].median()
data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
CPU times: user 5.13 s, sys: 60 ms, total: 5.19 s
Wall time: 5.19 s
@jezrael 解决方案更快
data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
CPU times: user 1.34 s, sys: 92 ms, total: 1.44 s
Wall time: 1.44 s
我想在这里添加一个更有效的答案,因为它涉及的代码更少。本质上,如果您使用布尔条件对数据帧进行切片并在这些特定条件下使用 .fillna,只需使用赋值:
我将使用来自不同 Kaggle 竞赛的示例:
# Use a mask as suggested by jesrael. It's just neater:
mask1 = (test_df.Neighborhood == 'IDOTRR') & (test_df.MSZoning.isna())
mask2 = (test_df.Neighborhood == 'Mitchel') & (test_df.MSZoning.isna())
# Use the mask and assign the desired value
test_df.loc[mask1, 'MSZoning'] = 'RM'
test_df.loc[mask2, 'MSZoning'] = 'RL'
这与 jesrael 的回答不同,因为 he/she 使用 .fillna()
分配回屏蔽数据帧。如果您打算使用遮罩,并且心中有特定的值,则无需使用“.fillna()”