使用 sklearn 对我的数据集的一列进行二值化后,结果不正确。代码哪里错了?
After binarizing a column of my data-set using sklearn the result is not correct. where is the code wrong?
我预处理数据集。我对其中一列进行了二值化处理。二值化后我认为这些值不正确。数据有 303 个观察值(行)和 14 个特征(列)。我正在二值化的列是最后一列。
这是我的部分代码-
import pandas as pd
import numpy as np
#importing the dataset
header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_data.csv', names= header_names)
array = dataset.values
# binarize num
from sklearn.preprocessing import Binarizer
x = array[:,13:]
binarize = Binarizer(threshold=0.0).fit(x)
transform_binarize = binarize.transform(x)
array[:,13:]=transform_binarize
print(transform_binarize)
这是原始数据列的样子-
0,2,1,0,0.........1,0,3,1,1,2
这是上面代码的输出-
[[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]]
我认为最后几个不正确。我不明白这是为什么。
如果我假设这是取自 this UCI repository and the csv file is this one 的心脏病数据集是正确的,那么在这种情况下,这些是二值化器的正确值。您使用的原始数据列在最后一行有一个 0
,我想您错过了,试试这个代码
for idx in range(0,len(x)):
print idx,x[idx],transform_binarize[idx]
输出
278 [1L] [1.]
279 [0L] [0.]
280 [2L] [1.]
281 [0L] [0.]
282 [3L] [1.]
283 [0L] [0.]
284 [2L] [1.]
285 [4L] [1.]
286 [2L] [1.]
287 [0L] [0.]
288 [0L] [0.]
289 [0L] [0.]
290 [1L] [1.]
291 [0L] [0.]
292 [2L] [1.]
293 [2L] [1.]
294 [1L] [1.]
295 [0L] [0.]
296 [3L] [1.]
297 [1L] [1.]
298 [1L] [1.]
299 [2L] [1.]
300 [3L] [1.]
301 [1L] [1.]
302 [0L] [0.] #<--- I think you missed this row while reading your dataset
如果您尝试此代码,那么您会发现二值化器正在正常工作。
我预处理数据集。我对其中一列进行了二值化处理。二值化后我认为这些值不正确。数据有 303 个观察值(行)和 14 个特征(列)。我正在二值化的列是最后一列。
这是我的部分代码-
import pandas as pd
import numpy as np
#importing the dataset
header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_data.csv', names= header_names)
array = dataset.values
# binarize num
from sklearn.preprocessing import Binarizer
x = array[:,13:]
binarize = Binarizer(threshold=0.0).fit(x)
transform_binarize = binarize.transform(x)
array[:,13:]=transform_binarize
print(transform_binarize)
这是原始数据列的样子-
0,2,1,0,0.........1,0,3,1,1,2
这是上面代码的输出-
[[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[1.]
[0.]
[1.]
[1.]
[0.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[0.]
[0.]
[1.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[0.]
[0.]
[1.]
[0.]
[1.]
[1.]
[1.]
[0.]
[1.]
[1.]
[1.]
[1.]
[1.]
[1.]
[0.]]
我认为最后几个不正确。我不明白这是为什么。
如果我假设这是取自 this UCI repository and the csv file is this one 的心脏病数据集是正确的,那么在这种情况下,这些是二值化器的正确值。您使用的原始数据列在最后一行有一个 0
,我想您错过了,试试这个代码
for idx in range(0,len(x)):
print idx,x[idx],transform_binarize[idx]
输出
278 [1L] [1.]
279 [0L] [0.]
280 [2L] [1.]
281 [0L] [0.]
282 [3L] [1.]
283 [0L] [0.]
284 [2L] [1.]
285 [4L] [1.]
286 [2L] [1.]
287 [0L] [0.]
288 [0L] [0.]
289 [0L] [0.]
290 [1L] [1.]
291 [0L] [0.]
292 [2L] [1.]
293 [2L] [1.]
294 [1L] [1.]
295 [0L] [0.]
296 [3L] [1.]
297 [1L] [1.]
298 [1L] [1.]
299 [2L] [1.]
300 [3L] [1.]
301 [1L] [1.]
302 [0L] [0.] #<--- I think you missed this row while reading your dataset
如果您尝试此代码,那么您会发现二值化器正在正常工作。