无法估算缺失的数值

Unable to impute missing numerical values

我想为数值和标称值估算缺失值。我用于查找缺失数值的代码没有 return 任何内容,即使其中一列 HDI for year 实际上具有空值。我的代码有什么问题?

import numpy as np
import pandas as pd
import seaborn as sns; sns.set(style="ticks", color_codes=True)
from pandas.api.types import is_numeric_dtype

df.head()
country year    sex age suicides_no population  suicides/100k pop   country-year    HDI for year    gdp_for_year ($)    gdp_per_capita ($)  generation  year_label  sex_label   age_label   generation_label    is_duplicate
0   Albania 1987    male    15-24 years 21  312900  6.71    Albania1987 NaN 2156624900  796 Generation X    2   1   0   2   False
1   Albania 1987    male    35-54 years 16  308000  5.19    Albania1987 NaN 2156624900  796 Silent  2   1   2   5   False
2   Albania 1987    female  15-24 years 14  289700  4.83    Albania1987 NaN 2156624900  796 Generation X    2   0   0   2   False
3   Albania 1987    male    75+ years   1   21800   4.59    Albania1987 NaN 2156624900  796 G.I. Generation 2   1   5   1   False
4   Albania 1987    male    25-34 years 9   274300  3.28    Albania1987 NaN 2156624900  796 Boomers 2   1   1   0   False

有问题的代码

# Check if we have NaN numerical values
if is_numeric_dtype(df) is True:
    if df.isnull().any() is True:
        # Mean values of columns
        print(f"mean-df[i].column = {np.mean(df.column)}")
        # Impute
        df = df.fillna(df.mean())

剩下的代码看起来还不错

# Check for missing nominal data
else:
    for col in df.columns:
        if df[col].dtype == object:
            print(col, df[col].unique())
            print(f"mode-df[i], df[i].value_counts().index[0]")
            # Replace '?' with mode - value/level with highest frequency in the feature
            df[col] = df[col].replace({'?': 'df[i].value_counts().index[0]'})

数值插补所需的输出。

# Do we have NaN in our dataset?
df.isnull().any()

country               False
year                  False
sex                   False
age                   False
suicides_no           False
population            False
suicides/100k pop     False
country-year          False
HDI for year           True
gdp_for_year ($)      False
gdp_per_capita ($)    False
generation            False
year_label            False
sex_label             False
age_label             False
generation_label      False
is_duplicate          False
dtype: bool

print(f"mean-HDI-for-year= {np.mean(df['HDI for year'])}")

> mean-HDI-for-year= 0.7766011477761785

# Impute
df['HDI for year'] = df['HDI for year'].fillna(df['HDI for year'].mean())

如评论中所述,使用 for loop 并遍历您的列是没有意义的。您可以使用 select_dtypes:

分别估算您的数字列和分类列

假设DF

>>> df
   year  sex   age    income
0  2020    M  27.0   50000.0
1  2020    F   NaN       NaN
2  2020    M  29.0   20000.0
3  2020    F   NaN       NaN
4  2020  NaN  23.0   30000.0
5  2020  NaN  24.0       NaN
6  2020    M   NaN  100000.0

>>> df.isnull().sum()
year      0
sex       2
age       3
income    3

估算您的 numeric 列:

>>> df.fillna(df.select_dtypes(include='number').mean(), inplace=True)
   year  sex    age    income
0  2020    M  27.00   50000.0
1  2020    F  25.75   50000.0
2  2020    M  29.00   20000.0
3  2020    F  25.75   50000.0
4  2020  NaN  23.00   30000.0
5  2020  NaN  24.00   50000.0
6  2020    M  25.75  100000.0

然后是您的 object 列:

   year sex    age    income
0  2020   M  27.00   50000.0
1  2020   F  25.75   50000.0
2  2020   M  29.00   20000.0
3  2020   F  25.75   50000.0
4  2020   M  23.00   30000.0
5  2020   M  24.00   50000.0
6  2020   M  25.75  100000.0