用其他数据框中的值替换 Nan 值
Replacing Nan values with values from other dataframe
我正在处理缺失的数据。
我有一个有 200 万行的 table,如下所示:
main_category_en eco_score
mustard 60
mustard 62
mustard NaN
cheese 20
NaN 1
cheese NaN
我为每个类别创建了一个平均值为 eco_score 的新 Dataframe:df_mean
main_category_en eco_score
mustard 61.5
cheese 20
我需要尽可能用 df_mean
的平均值 eco_score
替换原始 df 的 NaN
值。我需要得到这个:
main_category_en eco_score
mustard 60
mustard 62
mustard 61.5
cheese 20
NaN 1
cheese 20
我已经尝试了 .fillna()
和 merge()
,但是 main_category_en = NaN
行被删除了。
我已经进入了这个循环:
def replace_mean(df, 'main_category_en', 'eco_score'):
dt = df[['main_category_en','eco_score']]
dt = dt.dropna()
dmeancat= dt.groupby('main_category_en').mean()
for i in range (0, len(df)):
print(i)
if (pd.isna(df['main_category_en'][i]) == False) and (pd.isna(df['eco_score'][i]) == True) and (df['main_category_en'][i] in dmeancat.index):
print(yes)
#value = dmeancat.loc[(dmeancat.index == df['main_category_en'][i]), 'ecoscore_score_fr'].iloc[0]
df.iloc[i,df.columns.get_loc('ecoscore_score_fr')] = value
return df
这个循环有效,但是非常耗时
您可以在原始数据框 df
上填充 NaN
:
df['eco_score'] = df.groupby('main_category_en')['eco_score'].apply(lambda x: x.fillna(x.mean()))
main_category_en eco_score
0 mustard 60.0
1 mustard 62.0
2 mustard 61.0
3 cheese 20.0
4 NaN NaN
5 cheese 20.0
我正在处理缺失的数据。
我有一个有 200 万行的 table,如下所示:
main_category_en eco_score
mustard 60
mustard 62
mustard NaN
cheese 20
NaN 1
cheese NaN
我为每个类别创建了一个平均值为 eco_score 的新 Dataframe:df_mean
main_category_en eco_score
mustard 61.5
cheese 20
我需要尽可能用 df_mean
的平均值 eco_score
替换原始 df 的 NaN
值。我需要得到这个:
main_category_en eco_score
mustard 60
mustard 62
mustard 61.5
cheese 20
NaN 1
cheese 20
我已经尝试了 .fillna()
和 merge()
,但是 main_category_en = NaN
行被删除了。
我已经进入了这个循环:
def replace_mean(df, 'main_category_en', 'eco_score'):
dt = df[['main_category_en','eco_score']]
dt = dt.dropna()
dmeancat= dt.groupby('main_category_en').mean()
for i in range (0, len(df)):
print(i)
if (pd.isna(df['main_category_en'][i]) == False) and (pd.isna(df['eco_score'][i]) == True) and (df['main_category_en'][i] in dmeancat.index):
print(yes)
#value = dmeancat.loc[(dmeancat.index == df['main_category_en'][i]), 'ecoscore_score_fr'].iloc[0]
df.iloc[i,df.columns.get_loc('ecoscore_score_fr')] = value
return df
这个循环有效,但是非常耗时
您可以在原始数据框 df
上填充 NaN
:
df['eco_score'] = df.groupby('main_category_en')['eco_score'].apply(lambda x: x.fillna(x.mean()))
main_category_en eco_score
0 mustard 60.0
1 mustard 62.0
2 mustard 61.0
3 cheese 20.0
4 NaN NaN
5 cheese 20.0