ValueError: Input contains NaN, infinity or a value too large for dtype('float64') while preprocessing Data

ValueError: Input contains NaN, infinity or a value too large for dtype('float64') while preprocessing Data

我有两个 CSV 文件 (Training set and Test Set)。由于在少数列中有可见的 NaN 值(statushedge_valueindicator_codeportfolio_iddesk_idoffice_id).

我通过将 NaN 值替换为对应于该列的一些巨大值来开始该过程。 然后我正在做 LabelEncoding 来删除文本数据并将它们转换为数值数据。 现在,当我尝试对分类数据执行 OneHotEncoding 时,出现错误。我尝试将输入一一输入到 OneHotEncoding 构造函数中,但每一列都出现相同的错误。

基本上,我的最终目标是预测 return 值,但因此我陷入了数据预处理部分。我该如何解决这个问题?

我正在使用 Python3.6PandasSklearn 进行数据处理。

代码

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')

# Replacing Nan values here
train_data['status']=train_data['status'].fillna(2.0)
train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)
train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)
train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')
train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')
train_data['office_id']=train_data['office_id'].fillna('OFF99999999')

x_train = train_data.iloc[:, :-1].values
y_train = train_data.iloc[:, 17].values

# =============================================================================
# from sklearn.preprocessing import Imputer
# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
# imputer.fit(x_train[:, 15:17])
# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])
# 
# imputer.fit(x_train[:, 12:13])
# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])
# =============================================================================


# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 
# Country name, Purchased status will give trouble
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])
x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])
x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])
x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])
x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])
x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])
x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])


# =============================================================================
# import numpy as np
# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)
# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)
# np.isnan(x_train[:, 3]).any()
# =============================================================================


# =============================================================================
# from sklearn.preprocessing import StandardScaler
# sc_X = StandardScaler()
# x_train = sc_X.fit_transform(x_train)
# =============================================================================

onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])
x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

错误

Traceback (most recent call last):

  File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>
    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform
    self.categorical_features, copy=True)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

错误在于您将其他特征视为非分类特征。

'hedge_value''indicator_code' 等列包含混合类型的数据,例如原始 csv 中的 TRUEFALSE 和 [=] 中的 2.0 17=] 呼叫。 OneHotEncoder 无法处理它们。

如 OneHotEncoder fit() 文档中所述:

 fit(X, y=None)

    Fit OneHotEncoder to X.
    Parameters: 

    X : array-like, shape [n_samples, n_feature]

        Input array of type int.

你可以看到它要求所有的X都是数值类型(int,但float也可以)。

作为解决方法,您可以执行此操作来编码您的分类特征:

X_train_categorical = x_train[:, [0,1,2,3,6,8,14]]
onehotencoder = OneHotEncoder()
X_train_categorical = onehotencoder.fit_transform(X_train_categorical).toarray()

然后将其与您的非分类特征连接起来。

发布问题后,我再次浏览数据集,发现另一列带有 NaN。我不敢相信我在这上面浪费了这么多时间,而我本可以使用 Pandas 函数来获取具有 NaN 的列的列表。所以,使用下面的代码,我发现我错过了三列。当我本可以使用此功能时,我正在视觉上搜索 NaN。在处理了这些新的 NaN 之后,代码正常工作。

pd.isnull(train_data).sum() > 0

结果

portfolio_id      False
desk_id           False
office_id         False
pf_category       False
start_date        False
sold               True
country_code      False
euribor_rate      False
currency          False
libor_rate         True
bought             True
creation_date     False
indicator_code    False
sell_date         False
type              False
hedge_value       False
status            False
return            False
dtype: bool

要在生产中使用它,最佳做法是使用 Imputer,然后将模型保存在 pkl 中

这是一个解决方法

df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)

更好用