Python 索引超出了大小为 3 的轴 0 的范围

Python index is out of bounds for axis 0 with size 3

我有以下模拟测试数据框,所有列都有对象格式,除了 'Defect' 列具有 int 并且是目标特征。

我进行以下步骤:

  1. 创建数据框
  2. 在 X 和 y 中拆分
  3. 建立一个热编码类别的管道
  4. 使用交叉验证来衡量模型的准确性
import pandas as pd

data = {1 : ['test', '2222', '1111', '3333', '1111'],
        2 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
        3 : ['x', 'y', 'z', 't', 'x'],
        'Defect': [0, 1, 0, 1, 0]
        }

data = pd.DataFrame(data)

X = data.drop('Defect', axis = 'columns')
y = data['Defect']


from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

ohe = OneHotEncoder(handle_unknown='ignore')
cat_cols = make_column_selector(dtype_include = 'object')

preprocessor = make_column_transformer((make_pipeline(ohe), cat_cols))
pipe = make_pipeline(preprocessor, LogisticRegression())

from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X, y, cv=3, scoring='accuracy')
print(scores)

不幸的是,我的分数输出是 [nan nan nan],在输出下方我收到错误消息:

... The above exception was the direct cause of the following exception: ... 
ValueError: all features must be in [0, 2] or [-3, 0]...

为什么会这样?如果我更改一列的数据类型,代码似乎可以工作。

它似乎不喜欢从 1 开始的列名。试试这个:

#       V...look here
data = {0 : ['test', '2222', '1111', '3333', '1111'],
        1 : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
        2 : ['x', 'y', 'z', 't', 'x'],
        'Defect': [0, 1, 0, 1, 0]
        }