无法估算缺失的数值
Unable to impute missing numerical values
我想为数值和标称值估算缺失值。我用于查找缺失数值的代码没有 return 任何内容,即使其中一列 HDI for year
实际上具有空值。我的代码有什么问题?
import numpy as np
import pandas as pd
import seaborn as sns; sns.set(style="ticks", color_codes=True)
from pandas.api.types import is_numeric_dtype
df.head()
country year sex age suicides_no population suicides/100k pop country-year HDI for year gdp_for_year ($) gdp_per_capita ($) generation year_label sex_label age_label generation_label is_duplicate
0 Albania 1987 male 15-24 years 21 312900 6.71 Albania1987 NaN 2156624900 796 Generation X 2 1 0 2 False
1 Albania 1987 male 35-54 years 16 308000 5.19 Albania1987 NaN 2156624900 796 Silent 2 1 2 5 False
2 Albania 1987 female 15-24 years 14 289700 4.83 Albania1987 NaN 2156624900 796 Generation X 2 0 0 2 False
3 Albania 1987 male 75+ years 1 21800 4.59 Albania1987 NaN 2156624900 796 G.I. Generation 2 1 5 1 False
4 Albania 1987 male 25-34 years 9 274300 3.28 Albania1987 NaN 2156624900 796 Boomers 2 1 1 0 False
有问题的代码
# Check if we have NaN numerical values
if is_numeric_dtype(df) is True:
if df.isnull().any() is True:
# Mean values of columns
print(f"mean-df[i].column = {np.mean(df.column)}")
# Impute
df = df.fillna(df.mean())
剩下的代码看起来还不错
# Check for missing nominal data
else:
for col in df.columns:
if df[col].dtype == object:
print(col, df[col].unique())
print(f"mode-df[i], df[i].value_counts().index[0]")
# Replace '?' with mode - value/level with highest frequency in the feature
df[col] = df[col].replace({'?': 'df[i].value_counts().index[0]'})
数值插补所需的输出。
# Do we have NaN in our dataset?
df.isnull().any()
country False
year False
sex False
age False
suicides_no False
population False
suicides/100k pop False
country-year False
HDI for year True
gdp_for_year ($) False
gdp_per_capita ($) False
generation False
year_label False
sex_label False
age_label False
generation_label False
is_duplicate False
dtype: bool
print(f"mean-HDI-for-year= {np.mean(df['HDI for year'])}")
> mean-HDI-for-year= 0.7766011477761785
# Impute
df['HDI for year'] = df['HDI for year'].fillna(df['HDI for year'].mean())
如评论中所述,使用 for loop
并遍历您的列是没有意义的。您可以使用 select_dtypes
:
分别估算您的数字列和分类列
假设DF
:
>>> df
year sex age income
0 2020 M 27.0 50000.0
1 2020 F NaN NaN
2 2020 M 29.0 20000.0
3 2020 F NaN NaN
4 2020 NaN 23.0 30000.0
5 2020 NaN 24.0 NaN
6 2020 M NaN 100000.0
>>> df.isnull().sum()
year 0
sex 2
age 3
income 3
估算您的 numeric
列:
>>> df.fillna(df.select_dtypes(include='number').mean(), inplace=True)
year sex age income
0 2020 M 27.00 50000.0
1 2020 F 25.75 50000.0
2 2020 M 29.00 20000.0
3 2020 F 25.75 50000.0
4 2020 NaN 23.00 30000.0
5 2020 NaN 24.00 50000.0
6 2020 M 25.75 100000.0
然后是您的 object
列:
year sex age income
0 2020 M 27.00 50000.0
1 2020 F 25.75 50000.0
2 2020 M 29.00 20000.0
3 2020 F 25.75 50000.0
4 2020 M 23.00 30000.0
5 2020 M 24.00 50000.0
6 2020 M 25.75 100000.0
我想为数值和标称值估算缺失值。我用于查找缺失数值的代码没有 return 任何内容,即使其中一列 HDI for year
实际上具有空值。我的代码有什么问题?
import numpy as np
import pandas as pd
import seaborn as sns; sns.set(style="ticks", color_codes=True)
from pandas.api.types import is_numeric_dtype
df.head()
country year sex age suicides_no population suicides/100k pop country-year HDI for year gdp_for_year ($) gdp_per_capita ($) generation year_label sex_label age_label generation_label is_duplicate
0 Albania 1987 male 15-24 years 21 312900 6.71 Albania1987 NaN 2156624900 796 Generation X 2 1 0 2 False
1 Albania 1987 male 35-54 years 16 308000 5.19 Albania1987 NaN 2156624900 796 Silent 2 1 2 5 False
2 Albania 1987 female 15-24 years 14 289700 4.83 Albania1987 NaN 2156624900 796 Generation X 2 0 0 2 False
3 Albania 1987 male 75+ years 1 21800 4.59 Albania1987 NaN 2156624900 796 G.I. Generation 2 1 5 1 False
4 Albania 1987 male 25-34 years 9 274300 3.28 Albania1987 NaN 2156624900 796 Boomers 2 1 1 0 False
有问题的代码
# Check if we have NaN numerical values
if is_numeric_dtype(df) is True:
if df.isnull().any() is True:
# Mean values of columns
print(f"mean-df[i].column = {np.mean(df.column)}")
# Impute
df = df.fillna(df.mean())
剩下的代码看起来还不错
# Check for missing nominal data
else:
for col in df.columns:
if df[col].dtype == object:
print(col, df[col].unique())
print(f"mode-df[i], df[i].value_counts().index[0]")
# Replace '?' with mode - value/level with highest frequency in the feature
df[col] = df[col].replace({'?': 'df[i].value_counts().index[0]'})
数值插补所需的输出。
# Do we have NaN in our dataset?
df.isnull().any()
country False
year False
sex False
age False
suicides_no False
population False
suicides/100k pop False
country-year False
HDI for year True
gdp_for_year ($) False
gdp_per_capita ($) False
generation False
year_label False
sex_label False
age_label False
generation_label False
is_duplicate False
dtype: bool
print(f"mean-HDI-for-year= {np.mean(df['HDI for year'])}")
> mean-HDI-for-year= 0.7766011477761785
# Impute
df['HDI for year'] = df['HDI for year'].fillna(df['HDI for year'].mean())
如评论中所述,使用 for loop
并遍历您的列是没有意义的。您可以使用 select_dtypes
:
假设DF
:
>>> df
year sex age income
0 2020 M 27.0 50000.0
1 2020 F NaN NaN
2 2020 M 29.0 20000.0
3 2020 F NaN NaN
4 2020 NaN 23.0 30000.0
5 2020 NaN 24.0 NaN
6 2020 M NaN 100000.0
>>> df.isnull().sum()
year 0
sex 2
age 3
income 3
估算您的 numeric
列:
>>> df.fillna(df.select_dtypes(include='number').mean(), inplace=True)
year sex age income
0 2020 M 27.00 50000.0
1 2020 F 25.75 50000.0
2 2020 M 29.00 20000.0
3 2020 F 25.75 50000.0
4 2020 NaN 23.00 30000.0
5 2020 NaN 24.00 50000.0
6 2020 M 25.75 100000.0
然后是您的 object
列:
year sex age income
0 2020 M 27.00 50000.0
1 2020 F 25.75 50000.0
2 2020 M 29.00 20000.0
3 2020 F 25.75 50000.0
4 2020 M 23.00 30000.0
5 2020 M 24.00 50000.0
6 2020 M 25.75 100000.0