为什么它会自动转换为 numpy.ndarray?
Why it did convert to numpy.ndarray automatically?
我在我的数据库中做预测变量和 class 之间的划分,所以我意识到我必须先做 LabelEncoder 转换,然后做 OneHotEncoder,在第一个数据库中我是这样做的:
label_encoder_workclass = LabelEncoder()
label_encoder_education = LabelEncoder()
label_encoder_marital = LabelEncoder()
label_encoder_occupation = LabelEncoder()
label_encoder_relationship = LabelEncoder()
label_encoder_race = LabelEncoder()
label_encoder_sex = LabelEncoder()
label_encoder_country = LabelEncoder()
X_census[:,1] = label_encoder_workclass.fit_transform(X_census[:,1])
X_census[:,3] = label_encoder_education.fit_transform(X_census[:,3])
X_census[:,5] = label_encoder_marital.fit_transform(X_census[:,5])
X_census[:,6] = label_encoder_occupation.fit_transform(X_census[:,6])
X_census[:,7] = label_encoder_relationship.fit_transform(X_census[:,7])
X_census[:,8] = label_encoder_race.fit_transform(X_census[:,8])
X_census[:,9] = label_encoder_sex.fit_transform(X_census[:,9])
X_census[:,13] = label_encoder_country.fit_transform(X_census[:, 13])
onehotenconder_census = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(), [1,3,5,6,7,8,9,13])], remainder='passthrough')
X_census = onehotenconder_census.fit_transform(X_census).toarray()
在第二个数据库中是这样的:
label_encoder_personHomeOwnership = LabelEncoder()
label_encoder_loanIntent = LabelEncoder()
label_encoder_loanGrade = LabelEncoder()
label_encoder_cbPersonDefaultOnFile = LabelEncoder()
X_credit[:,2] = label_encoder_personHomeOwnership.fit_transform(X_credit[:,2])
X_credit[:,4] = label_encoder_loanIntent.fit_transform(X_credit[:,4])
X_credit[:,5] = label_encoder_loanGrade.fit_transform(X_credit[:,5])
X_credit[:,9] = label_encoder_personHomeOwnership.fit_transform(X_credit[:,9])
oneHotEncoder_credit = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(), [2,4,5,9])], remainder='passthrough')
X_credit = oneHotEncoder_credit.fit_transform(X_credit)
让我感兴趣的是,为什么在第一个中我必须使用 toarray()
方法将其转换为 numpy.ndarray 类型的对象,而在第二个中我没有,它转换了自动。
请有人帮我回答这个问题。我是不是做错了什么?
非常感谢您
来自 help page of ColumnTransformer :
sparse_thresholdfloat, default=0.3
If the output of the different transformers contains sparse matrices,
these will be stacked as a sparse matrix if the overall density is
lower than this value. Use sparse_threshold=0 to always return dense.
When the transformed output consists of all dense data, the stacked
result will be dense, and this keyword will be ignored.
在您的例子中,第一个示例比第二个示例具有更多的稀疏条目,因此它被转换为稀疏矩阵。 .toarray()
方法将其从稀疏转换为密集。
如果内存不是问题,将其设置为 sparse_threshold=0
将确保您每次都能获得密集矩阵。
例如,如果我们有很多类别的列:
from sklearn.compose import ColumnTransformer
import numpy as np
np.random.seed(111)
X = np.random.randint(0,10,(100,10))
ct = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(),
np.arange(10))], remainder='passthrough')
type(ct.fit_transform(X))
scipy.sparse.csr.csr_matrix
ct = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(),
np.arange(10))], remainder='passthrough',sparse_threshold=0)
type(ct.fit_transform(X))
numpy.ndarray
与类别较少的列相比:
X = np.random.randint(0,2,(100,10))
ct = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(),
np.arange(10))], remainder='passthrough')
type(ct.fit_transform(X))
numpy.ndarray
我在我的数据库中做预测变量和 class 之间的划分,所以我意识到我必须先做 LabelEncoder 转换,然后做 OneHotEncoder,在第一个数据库中我是这样做的:
label_encoder_workclass = LabelEncoder()
label_encoder_education = LabelEncoder()
label_encoder_marital = LabelEncoder()
label_encoder_occupation = LabelEncoder()
label_encoder_relationship = LabelEncoder()
label_encoder_race = LabelEncoder()
label_encoder_sex = LabelEncoder()
label_encoder_country = LabelEncoder()
X_census[:,1] = label_encoder_workclass.fit_transform(X_census[:,1])
X_census[:,3] = label_encoder_education.fit_transform(X_census[:,3])
X_census[:,5] = label_encoder_marital.fit_transform(X_census[:,5])
X_census[:,6] = label_encoder_occupation.fit_transform(X_census[:,6])
X_census[:,7] = label_encoder_relationship.fit_transform(X_census[:,7])
X_census[:,8] = label_encoder_race.fit_transform(X_census[:,8])
X_census[:,9] = label_encoder_sex.fit_transform(X_census[:,9])
X_census[:,13] = label_encoder_country.fit_transform(X_census[:, 13])
onehotenconder_census = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(), [1,3,5,6,7,8,9,13])], remainder='passthrough')
X_census = onehotenconder_census.fit_transform(X_census).toarray()
在第二个数据库中是这样的:
label_encoder_personHomeOwnership = LabelEncoder()
label_encoder_loanIntent = LabelEncoder()
label_encoder_loanGrade = LabelEncoder()
label_encoder_cbPersonDefaultOnFile = LabelEncoder()
X_credit[:,2] = label_encoder_personHomeOwnership.fit_transform(X_credit[:,2])
X_credit[:,4] = label_encoder_loanIntent.fit_transform(X_credit[:,4])
X_credit[:,5] = label_encoder_loanGrade.fit_transform(X_credit[:,5])
X_credit[:,9] = label_encoder_personHomeOwnership.fit_transform(X_credit[:,9])
oneHotEncoder_credit = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(), [2,4,5,9])], remainder='passthrough')
X_credit = oneHotEncoder_credit.fit_transform(X_credit)
让我感兴趣的是,为什么在第一个中我必须使用 toarray()
方法将其转换为 numpy.ndarray 类型的对象,而在第二个中我没有,它转换了自动。
请有人帮我回答这个问题。我是不是做错了什么?
非常感谢您
来自 help page of ColumnTransformer :
sparse_thresholdfloat, default=0.3
If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.
在您的例子中,第一个示例比第二个示例具有更多的稀疏条目,因此它被转换为稀疏矩阵。 .toarray()
方法将其从稀疏转换为密集。
如果内存不是问题,将其设置为 sparse_threshold=0
将确保您每次都能获得密集矩阵。
例如,如果我们有很多类别的列:
from sklearn.compose import ColumnTransformer
import numpy as np
np.random.seed(111)
X = np.random.randint(0,10,(100,10))
ct = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(),
np.arange(10))], remainder='passthrough')
type(ct.fit_transform(X))
scipy.sparse.csr.csr_matrix
ct = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(),
np.arange(10))], remainder='passthrough',sparse_threshold=0)
type(ct.fit_transform(X))
numpy.ndarray
与类别较少的列相比:
X = np.random.randint(0,2,(100,10))
ct = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(),
np.arange(10))], remainder='passthrough')
type(ct.fit_transform(X))
numpy.ndarray