每行用一个随机值替换 NaN
Replace NaN with a random value every row
我有一个包含列 'Self_Employed' 的数据集。这些列中有值 'Yes'、'No' 和 'NaN。我想用在 calc() 中计算的值替换 NaN 值。我尝试了一些在此处找到的方法,但找不到适合我的方法。
这是我的代码,我把我尝试过的东西放在评论中。:
# Handling missing data - Self_employed
SEyes = (df['Self_Employed']=='Yes').sum()
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = randint(0,(SEno+SEyes))
if rand_SE > 81:
return 'No'
else:
return 'Yes'
> # df['Self_Employed'] = df['Self_Employed'].fillna(randint(0,100))
> #df['Self_Employed'].isnull().apply(lambda v: calc())
>
>
> # df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
> # df[df['Self_Employed']]
>
> # df_nan['Self_Employed'] = df_nan['Self_Employed'].isnull().apply(lambda v: calc())
> # df_nan
>
> # for i in range(df['Self_Employed'].isnull().sum()):
> # print(df.Self_Employed[i]
df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
df
现在我用 df_nan 尝试的行似乎可以工作,但是我有一个单独的集合,其中只有以前的缺失值,但我想在整个数据集中填充缺失值。对于最后一行我收到错误,我链接到它的屏幕截图。
你明白我的问题吗?如果明白,你能帮忙吗?
This is the set with only the rows where Self_Employed is NaN
This is the original dataset
This is the error
df['Self_Employed'] = df['Self_Employed'].fillna(calc())
呢?
确保 SEno+SEyes != null
使用.loc方法设置Self_Employed为空时的值
SEyes = (df['Self_Employed']=='Yes').sum() + 1
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = np.random.randint(0,(SEno+SEyes))
if(rand_SE >= 81):
return 'No'
else:
return 'Yes'
df.loc[df['Self_Employed'].isna(), 'Self_Employed'] = df.loc[df['Self_Employed'].isna(), 'Self_Employed'].apply(lambda x: calc())
您可以先确定 NaN
的位置,例如
na_loc = df.index[df['Self_Employed'].isnull()]
计算您的列中 NaN
的数量,例如
num_nas = len(na_loc)
然后生成相应数量的随机数,轻松索引和设置
fill_values = pd.DataFrame({'Self_Employed': [random.randint(0,100) for i in range(num_nas)]}, index = na_loc)
最后在您的数据框中替换这些值
df.loc[na_loc]['Self_Employed'] = fill_values
我有一个包含列 'Self_Employed' 的数据集。这些列中有值 'Yes'、'No' 和 'NaN。我想用在 calc() 中计算的值替换 NaN 值。我尝试了一些在此处找到的方法,但找不到适合我的方法。 这是我的代码,我把我尝试过的东西放在评论中。:
# Handling missing data - Self_employed
SEyes = (df['Self_Employed']=='Yes').sum()
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = randint(0,(SEno+SEyes))
if rand_SE > 81:
return 'No'
else:
return 'Yes'
> # df['Self_Employed'] = df['Self_Employed'].fillna(randint(0,100))
> #df['Self_Employed'].isnull().apply(lambda v: calc())
>
>
> # df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
> # df[df['Self_Employed']]
>
> # df_nan['Self_Employed'] = df_nan['Self_Employed'].isnull().apply(lambda v: calc())
> # df_nan
>
> # for i in range(df['Self_Employed'].isnull().sum()):
> # print(df.Self_Employed[i]
df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
df
现在我用 df_nan 尝试的行似乎可以工作,但是我有一个单独的集合,其中只有以前的缺失值,但我想在整个数据集中填充缺失值。对于最后一行我收到错误,我链接到它的屏幕截图。 你明白我的问题吗?如果明白,你能帮忙吗?
This is the set with only the rows where Self_Employed is NaN
This is the original dataset
This is the error
df['Self_Employed'] = df['Self_Employed'].fillna(calc())
呢?
确保 SEno+SEyes != null 使用.loc方法设置Self_Employed为空时的值
SEyes = (df['Self_Employed']=='Yes').sum() + 1
SEno = (df['Self_Employed']=='No').sum()
def calc():
rand_SE = np.random.randint(0,(SEno+SEyes))
if(rand_SE >= 81):
return 'No'
else:
return 'Yes'
df.loc[df['Self_Employed'].isna(), 'Self_Employed'] = df.loc[df['Self_Employed'].isna(), 'Self_Employed'].apply(lambda x: calc())
您可以先确定 NaN
的位置,例如
na_loc = df.index[df['Self_Employed'].isnull()]
计算您的列中 NaN
的数量,例如
num_nas = len(na_loc)
然后生成相应数量的随机数,轻松索引和设置
fill_values = pd.DataFrame({'Self_Employed': [random.randint(0,100) for i in range(num_nas)]}, index = na_loc)
最后在您的数据框中替换这些值
df.loc[na_loc]['Self_Employed'] = fill_values