如何用交叉表中的数据替换 python DataFrame 中的 NaN
How to replace NaN in python DataFrame with data from crosstab
早上好,我是 pandas 的新手。我有一个名为 df 的 DataFrame,它有 4 列:Age、Survived、Pclass 和 Sex (PassengerID = index)。 Age 字段的一部分 = NaN
Age Survived Pclass Sex
PassengerId
6 NaN 0 3 male
18 NaN 1 2 male
20 NaN 1 3 female
27 NaN 0 3 male
29 NaN 1 3 female
我想用交叉表中的数据替换 Age NaN。
mean_val = pd.crosstab(index=df["Survived"],columns[df['Sex'],df['Pclass']],values=df['Age'],aggfunc=np.mean)
产生以下结果:
Sex female male
Pclass 1 2 3 1 2 3
Survived
0 25.666667 36.000000 23.818182 44.581967 33.369048 27.255814
1 34.939024 28.080882 19.329787 36.248000 16.022000 22.274211
我想做的是:
df['Age'] = mean_val[[df['Sex']][df['Pclass']][df['Survived']]]
我使用交叉表查找特定乘客的地方。结果将如下所示:
Age Survived Pclass Sex
PassengerId
6 27.255814 0 3 male
18 16.022000 1 2 male
20 19.329787 1 3 female
27 27.255814 0 3 male
29 19.329787 1 3 female
提前感谢您的帮助!
我认为您需要 transform
并将每个组的 NaN
替换为 mean
:
df['Age'] = (df.groupby(['Survived','Sex','Pclass'])['Age']
.transform(lambda x: x.fillna(x.mean())))
如果要使用 mean_val
作为输入:
df = df.join(mean_val.unstack().rename('tmp'), ['Sex','Pclass','Survived'])
df['Age'] = df['Age'].combine_first(df['tmp'])
df = df.drop('tmp', axis=1)
示例:
c = ['PassengerId','Age','Survived','Pclass','Sex']
df = pd.DataFrame({'PassengerId': [6, 18, 20, 27, 29, 16, 118, 120, 127, 129],
'Age': [np.nan, np.nan, np.nan, np.nan, np.nan,
2.0, 3.0, 4.0, 3.0, 4.0],
'Survived': [0, 1, 1, 0, 1, 0, 1, 1, 0, 1],
'Pclass': [3, 2, 3, 3, 3, 3, 2, 3, 3, 3],
'Sex': ['male', 'male', 'female', 'male', 'female',
'male', 'male', 'female', 'male', 'female']},
columns=c)
print (df)
PassengerId Age Survived Pclass Sex
0 6 NaN 0 3 male
1 18 NaN 1 2 male
2 20 NaN 1 3 female
3 27 NaN 0 3 male
4 29 NaN 1 3 female
5 16 2.0 0 3 male
6 118 3.0 1 2 male
7 120 4.0 1 3 female
8 127 3.0 0 3 male
9 129 4.0 1 3 female
mean_val = pd.crosstab(index=df["Survived"],columns=[df['Sex'],df['Pclass']],values=df['Age'],aggfunc=np.mean)
print (mean_val)
Sex female male
Pclass 3 2 3
Survived
0 NaN NaN 2.5
1 4.0 3.0 NaN
df = df.join(mean_val.unstack().rename('tmp'), ['Sex','Pclass','Survived'])
df['Age'] = df['Age'].combine_first(df['tmp'])
df = df.drop('tmp', axis=1)
print (df)
PassengerId Age Survived Pclass Sex
0 6 2.5 0 3 male
1 18 3.0 1 2 male
2 20 4.0 1 3 female
3 27 2.5 0 3 male
4 29 4.0 1 3 female
5 16 2.0 0 3 male
6 118 3.0 1 2 male
7 120 4.0 1 3 female
8 127 3.0 0 3 male
9 129 4.0 1 3 female
df['Age'] = (df.groupby(['Survived','Sex','Pclass'])['Age']
.transform(lambda x: x.fillna(x.mean())))
print (df)
PassengerId Age Survived Pclass Sex
0 6 2.5 0 3 male
1 18 3.0 1 2 male
2 20 4.0 1 3 female
3 27 2.5 0 3 male
4 29 4.0 1 3 female
5 16 2.0 0 3 male
6 118 3.0 1 2 male
7 120 4.0 1 3 female
8 127 3.0 0 3 male
9 129 4.0 1 3 female
早上好,我是 pandas 的新手。我有一个名为 df 的 DataFrame,它有 4 列:Age、Survived、Pclass 和 Sex (PassengerID = index)。 Age 字段的一部分 = NaN
Age Survived Pclass Sex
PassengerId
6 NaN 0 3 male
18 NaN 1 2 male
20 NaN 1 3 female
27 NaN 0 3 male
29 NaN 1 3 female
我想用交叉表中的数据替换 Age NaN。
mean_val = pd.crosstab(index=df["Survived"],columns[df['Sex'],df['Pclass']],values=df['Age'],aggfunc=np.mean)
产生以下结果:
Sex female male
Pclass 1 2 3 1 2 3
Survived
0 25.666667 36.000000 23.818182 44.581967 33.369048 27.255814
1 34.939024 28.080882 19.329787 36.248000 16.022000 22.274211
我想做的是:
df['Age'] = mean_val[[df['Sex']][df['Pclass']][df['Survived']]]
我使用交叉表查找特定乘客的地方。结果将如下所示:
Age Survived Pclass Sex
PassengerId
6 27.255814 0 3 male
18 16.022000 1 2 male
20 19.329787 1 3 female
27 27.255814 0 3 male
29 19.329787 1 3 female
提前感谢您的帮助!
我认为您需要 transform
并将每个组的 NaN
替换为 mean
:
df['Age'] = (df.groupby(['Survived','Sex','Pclass'])['Age']
.transform(lambda x: x.fillna(x.mean())))
如果要使用 mean_val
作为输入:
df = df.join(mean_val.unstack().rename('tmp'), ['Sex','Pclass','Survived'])
df['Age'] = df['Age'].combine_first(df['tmp'])
df = df.drop('tmp', axis=1)
示例:
c = ['PassengerId','Age','Survived','Pclass','Sex']
df = pd.DataFrame({'PassengerId': [6, 18, 20, 27, 29, 16, 118, 120, 127, 129],
'Age': [np.nan, np.nan, np.nan, np.nan, np.nan,
2.0, 3.0, 4.0, 3.0, 4.0],
'Survived': [0, 1, 1, 0, 1, 0, 1, 1, 0, 1],
'Pclass': [3, 2, 3, 3, 3, 3, 2, 3, 3, 3],
'Sex': ['male', 'male', 'female', 'male', 'female',
'male', 'male', 'female', 'male', 'female']},
columns=c)
print (df)
PassengerId Age Survived Pclass Sex
0 6 NaN 0 3 male
1 18 NaN 1 2 male
2 20 NaN 1 3 female
3 27 NaN 0 3 male
4 29 NaN 1 3 female
5 16 2.0 0 3 male
6 118 3.0 1 2 male
7 120 4.0 1 3 female
8 127 3.0 0 3 male
9 129 4.0 1 3 female
mean_val = pd.crosstab(index=df["Survived"],columns=[df['Sex'],df['Pclass']],values=df['Age'],aggfunc=np.mean)
print (mean_val)
Sex female male
Pclass 3 2 3
Survived
0 NaN NaN 2.5
1 4.0 3.0 NaN
df = df.join(mean_val.unstack().rename('tmp'), ['Sex','Pclass','Survived'])
df['Age'] = df['Age'].combine_first(df['tmp'])
df = df.drop('tmp', axis=1)
print (df)
PassengerId Age Survived Pclass Sex
0 6 2.5 0 3 male
1 18 3.0 1 2 male
2 20 4.0 1 3 female
3 27 2.5 0 3 male
4 29 4.0 1 3 female
5 16 2.0 0 3 male
6 118 3.0 1 2 male
7 120 4.0 1 3 female
8 127 3.0 0 3 male
9 129 4.0 1 3 female
df['Age'] = (df.groupby(['Survived','Sex','Pclass'])['Age']
.transform(lambda x: x.fillna(x.mean())))
print (df)
PassengerId Age Survived Pclass Sex
0 6 2.5 0 3 male
1 18 3.0 1 2 male
2 20 4.0 1 3 female
3 27 2.5 0 3 male
4 29 4.0 1 3 female
5 16 2.0 0 3 male
6 118 3.0 1 2 male
7 120 4.0 1 3 female
8 127 3.0 0 3 male
9 129 4.0 1 3 female