为什么一列 1 会影响决策树分类器的结果？

Question

我在随机生成的分类问题上测试 sklearn 的 Pipeline：

import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

x, y = make_classification(n_samples=100, n_features=5, random_state=10)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)

model = DecisionTreeClassifier(random_state=0)
pipe = Pipeline(steps=[('scale', StandardScaler()),
                       ('poly', PolynomialFeatures(degree=2, include_bias=False)),
                       ('model', model)])
pipe.fit(x_train, y_train)

pipe_pred = pipe.predict(x_test)
accuracy_score(y_test, pipe_pred)

这导致准确度得分 .85。但是，当我将 PolynomialFeatures 参数 include_bias 更改为 True，这只是将一列 1 插入到数组中时，准确度得分变为 .90。为了可视化，下面我绘制了偏差为 True 和 False:

时的结果的单个树

当`include_bias=True`时：True

当`include_bias=False`时：False

这些图像是由 plot_tree(pipe['model']) 生成的。

除了在第 0 列中插入一个额外的 1 列 include_bias=True 之外，数据集是相同的。因此 include_bias=True 数据的列索引对应于 i + 1 列索引在 include_bias=False 数据中。（例如 with_bias[:, 5] == without_bias[:, 4]）

根据我的理解，1 列应该不会对决策树产生影响。我错过了什么？

Answer 1

来自documentation for DecisionTreeClassifier：

random_state : int, RandomState instance, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.

您已经设置了 random_state，但是具有不同的列数仍然会使那些随机洗牌不同。请注意，gini 的值对于每个节点的两棵树都是相同的，即使不同的特征正在分裂也是如此。

为什么一列 1 会影响决策树分类器的结果？

Why does a column of 1s impact the results of a decision tree classifier?

python

machine-learning

decision-tree

scikit-learn

当`include_bias=True`时：True

当`include_bias=False`时：False

为什么一列 1 会影响决策树分类器的结果？

Why does a column of 1s impact the results of a decision tree classifier?

python

machine-learning

decision-tree

scikit-learn

当include_bias=True时：True

当include_bias=False时：False

当`include_bias=True`时：True

当`include_bias=False`时：False