Python 索引超出了大小为 3 的轴 0 的范围

Question

我有以下模拟测试数据框，所有列都有对象格式，除了 'Defect' 列具有 int 并且是目标特征。

我进行以下步骤：

创建数据框
在 X 和 y 中拆分
建立一个热编码类别的管道
使用交叉验证来衡量模型的准确性

import pandas as pd

data = {1 : ['test', '2222', '1111', '3333', '1111'],
        2 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
        3 : ['x', 'y', 'z', 't', 'x'],
        'Defect': [0, 1, 0, 1, 0]
        }

data = pd.DataFrame(data)

X = data.drop('Defect', axis = 'columns')
y = data['Defect']


from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

ohe = OneHotEncoder(handle_unknown='ignore')
cat_cols = make_column_selector(dtype_include = 'object')

preprocessor = make_column_transformer((make_pipeline(ohe), cat_cols))
pipe = make_pipeline(preprocessor, LogisticRegression())

from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X, y, cv=3, scoring='accuracy')
print(scores)

不幸的是，我的分数输出是 [nan nan nan]，在输出下方我收到错误消息：

... The above exception was the direct cause of the following exception: ... 
ValueError: all features must be in [0, 2] or [-3, 0]...

为什么会这样？如果我更改一列的数据类型，代码似乎可以工作。

Answer 1

它似乎不喜欢从 1 开始的列名。试试这个：

#       V...look here
data = {0 : ['test', '2222', '1111', '3333', '1111'],
        1 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
        2 : ['x', 'y', 'z', 't', 'x'],
        'Defect': [0, 1, 0, 1, 0]
        }

Python 索引超出了大小为 3 的轴 0 的范围

Python index is out of bounds for axis 0 with size 3

python

scikit-learn