使用 sklearn 对我的数据集的一列进行二值化后,结果不正确。代码哪里错了?

After binarizing a column of my data-set using sklearn the result is not correct. where is the code wrong?

我预处理数据集。我对其中一列进行了二值化处理。二值化后我认为这些值不正确。数据有 303 个观察值(行)和 14 个特征(列)。我正在二值化的列是最后一列。

这是我的部分代码-

    import pandas as pd
    import numpy as np

    #importing the dataset
    header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
    dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_data.csv', names= header_names)


    array = dataset.values

    # binarize num
    from sklearn.preprocessing import Binarizer
    x = array[:,13:]
    binarize = Binarizer(threshold=0.0).fit(x)
    transform_binarize = binarize.transform(x)

    array[:,13:]=transform_binarize
    print(transform_binarize)

这是原始数据列的样子-

     0,2,1,0,0.........1,0,3,1,1,2

这是上面代码的输出-

         [[0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]]

我认为最后几个不正确。我不明白这是为什么。

如果我假设这是取自 this UCI repository and the csv file is this one 的心脏病数据集是正确的,那么在这种情况下,这些是二值化器的正确值。您使用的原始数据列在最后一行有一个 0,我想您错过了,试试这个代码

for idx in range(0,len(x)):
    print idx,x[idx],transform_binarize[idx]

输出

278 [1L] [1.]
279 [0L] [0.]
280 [2L] [1.]
281 [0L] [0.]
282 [3L] [1.]
283 [0L] [0.]
284 [2L] [1.]
285 [4L] [1.]
286 [2L] [1.]
287 [0L] [0.]
288 [0L] [0.]
289 [0L] [0.]
290 [1L] [1.]
291 [0L] [0.]
292 [2L] [1.]
293 [2L] [1.]
294 [1L] [1.]
295 [0L] [0.]
296 [3L] [1.]
297 [1L] [1.]
298 [1L] [1.]
299 [2L] [1.]
300 [3L] [1.]
301 [1L] [1.]
302 [0L] [0.]     #<--- I think you missed this row while reading your dataset

如果您尝试此代码,那么您会发现二值化器正在正常工作。