输入数组中 np.nan 的 OrdinalEncoder 结果不一致
Inconsistent results from OrdinalEncoder with np.nan in input array
我希望使用 OrdinalEncoder 对一些序数数据进行编码,格式如下:["6-10","11-15","1-5",...,np.nan]
,参数类别中指定的编码顺序为 ["1-5","6-10","11-15",...]
,np.nan 被忽略(我希望在填充 nans 之前先对给定的特征进行编码)。
根据用户手册,sklearn OrdinalEncoder 应该忽略输入数组中的 np.nan
:
[来自 https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features][1]
但从正常 list/np 获得的结果不一致。array/with 指定的类别参数:
!pip install -U scikit-learn
!pip install -U numpy
import sklearn
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
print(sklearn.__version__)
dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
dummy_array2 = np.array(["1-5","6-10","10-15","6-10","10-15","10-15","1-5",np.nan])
enc_order = ["1-5","6-10","10-15"]
enc1 = OrdinalEncoder()
enc2 = OrdinalEncoder()
enc3 = OrdinalEncoder(categories=[enc_order])
print(enc1.fit_transform(dummy_array))
print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
print(enc3.fit_transform(dummy_array))
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (1.0.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.21.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (3.0.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (1.21.3)
1.0.1
[[ 0.]
[ 2.]
[ 1.]
[ 2.]
[ 1.]
[ 1.]
[ 0.]
[nan]]
[[0.]
[2.]
[1.]
[2.]
[1.]
[1.]
[0.]
[3.]]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-c460949a3bd3> in <module>()
16 print(enc1.fit_transform(dummy_array))
17 print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
---> 18 print(enc3.fit_transform(dummy_array))
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
845 if y is None:
846 # fit method of arity 1 (unsupervised transformation)
--> 847 return self.fit(X, **fit_params).transform(X)
848 else:
849 # fit method of arity 2 (supervised transformation)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
884
885 # `_fit` will only raise an error when `self.handle_unknown="error"`
--> 886 self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
887
888 if self.handle_unknown == "use_encoded_value":
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown, force_all_finite)
114 " during fit".format(diff, i)
115 )
--> 116 raise ValueError(msg)
117 self.categories_.append(cats)
118
ValueError: Found unknown categories [nan] in column 0 during fit
由于我对numpy和sklearn的经验不多,所以我不确定这三种情况的结果不同的原因是什么。据我了解,前两种情况都应给出以下结果,第三种情况不应引发错误:
[[ 0.]
[ 2.]
[ 1.]
[ 2.]
[ 1.]
[ 1.]
[ 0.]
[nan]]
如有帮助,将不胜感激,谢谢!
[1]: https://i.stack.imgur.com/Gba8X.png
您需要明确如何处理未知(缺失)值:
from sklearn.preprocessing import OrdinalEncoder
dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
enc_order = ["1-5","6-10","10-15"]
# unknown_value is mandatory when handle_unknown is given
enc3 = OrdinalEncoder(categories=[enc_order],
handle_unknown='use_encoded_value',
unknown_value=np.nan)
enc3.fit_transform(dummy_array)
产量
array([[ 0.],
[ 1.],
[ 2.],
[ 1.],
[ 2.],
[ 2.],
[ 0.],
[nan]])
handle_uknown
的默认值是 "error"
,这是您得到的结果。
handle_unknown
: {‘error’, ‘use_encoded_value’}, default=’error’
When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to
‘use_encoded_value’, the encoded value of unknown categories will be
set to the value given for the parameter unknown_value
unknown_value
的帮助是:
unknown_value
: int or np.nan, default=None
When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown
categories. It has to be distinct from the values used to encode any
of the categories in fit. If set to np.nan, the dtype parameter must
be a float dtype.
您的 dummy_array2
输出所有编码值(包括 NaN)的原因是因为输入是 NumPy 字符串数组:np.nan
将转换为 'nan'
,因为其他元素是字符串,而 NumPy 数组需要单一数据类型。在这种情况下,dtype
是“U32”。结果,所有值都被正确编码为整数(好吧,浮点数)。
我希望使用 OrdinalEncoder 对一些序数数据进行编码,格式如下:["6-10","11-15","1-5",...,np.nan]
,参数类别中指定的编码顺序为 ["1-5","6-10","11-15",...]
,np.nan 被忽略(我希望在填充 nans 之前先对给定的特征进行编码)。
根据用户手册,sklearn OrdinalEncoder 应该忽略输入数组中的 np.nan
:
[来自 https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features][1]
但从正常 list/np 获得的结果不一致。array/with 指定的类别参数:
!pip install -U scikit-learn
!pip install -U numpy
import sklearn
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
print(sklearn.__version__)
dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
dummy_array2 = np.array(["1-5","6-10","10-15","6-10","10-15","10-15","1-5",np.nan])
enc_order = ["1-5","6-10","10-15"]
enc1 = OrdinalEncoder()
enc2 = OrdinalEncoder()
enc3 = OrdinalEncoder(categories=[enc_order])
print(enc1.fit_transform(dummy_array))
print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
print(enc3.fit_transform(dummy_array))
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (1.0.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.21.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (3.0.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (1.21.3)
1.0.1
[[ 0.]
[ 2.]
[ 1.]
[ 2.]
[ 1.]
[ 1.]
[ 0.]
[nan]]
[[0.]
[2.]
[1.]
[2.]
[1.]
[1.]
[0.]
[3.]]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-c460949a3bd3> in <module>()
16 print(enc1.fit_transform(dummy_array))
17 print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
---> 18 print(enc3.fit_transform(dummy_array))
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
845 if y is None:
846 # fit method of arity 1 (unsupervised transformation)
--> 847 return self.fit(X, **fit_params).transform(X)
848 else:
849 # fit method of arity 2 (supervised transformation)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
884
885 # `_fit` will only raise an error when `self.handle_unknown="error"`
--> 886 self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
887
888 if self.handle_unknown == "use_encoded_value":
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown, force_all_finite)
114 " during fit".format(diff, i)
115 )
--> 116 raise ValueError(msg)
117 self.categories_.append(cats)
118
ValueError: Found unknown categories [nan] in column 0 during fit
由于我对numpy和sklearn的经验不多,所以我不确定这三种情况的结果不同的原因是什么。据我了解,前两种情况都应给出以下结果,第三种情况不应引发错误:
[[ 0.]
[ 2.]
[ 1.]
[ 2.]
[ 1.]
[ 1.]
[ 0.]
[nan]]
如有帮助,将不胜感激,谢谢! [1]: https://i.stack.imgur.com/Gba8X.png
您需要明确如何处理未知(缺失)值:
from sklearn.preprocessing import OrdinalEncoder
dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
enc_order = ["1-5","6-10","10-15"]
# unknown_value is mandatory when handle_unknown is given
enc3 = OrdinalEncoder(categories=[enc_order],
handle_unknown='use_encoded_value',
unknown_value=np.nan)
enc3.fit_transform(dummy_array)
产量
array([[ 0.],
[ 1.],
[ 2.],
[ 1.],
[ 2.],
[ 2.],
[ 0.],
[nan]])
handle_uknown
的默认值是 "error"
,这是您得到的结果。
handle_unknown
: {‘error’, ‘use_encoded_value’}, default=’error’When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value
unknown_value
的帮助是:
unknown_value
: int or np.nan, default=NoneWhen the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.
您的 dummy_array2
输出所有编码值(包括 NaN)的原因是因为输入是 NumPy 字符串数组:np.nan
将转换为 'nan'
,因为其他元素是字符串,而 NumPy 数组需要单一数据类型。在这种情况下,dtype
是“U32”。结果,所有值都被正确编码为整数(好吧,浮点数)。