为什么一列 1 会影响决策树分类器的结果?
Why does a column of 1s impact the results of a decision tree classifier?
我在随机生成的分类问题上测试 sklearn 的 Pipeline
:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
x, y = make_classification(n_samples=100, n_features=5, random_state=10)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)
model = DecisionTreeClassifier(random_state=0)
pipe = Pipeline(steps=[('scale', StandardScaler()),
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('model', model)])
pipe.fit(x_train, y_train)
pipe_pred = pipe.predict(x_test)
accuracy_score(y_test, pipe_pred)
这导致准确度得分 .85
。但是,当我将 PolynomialFeatures
参数 include_bias
更改为 True
,这只是将一列 1 插入到数组中时,准确度得分变为 .90
。为了可视化,下面我绘制了偏差为 True
和 False
:
时的结果的单个树
当include_bias=True
时:True
当include_bias=False
时:False
这些图像是由 plot_tree(pipe['model'])
生成的。
除了在第 0 列中插入一个额外的 1 列 include_bias=True
之外,数据集是相同的。因此 include_bias=True
数据的列索引对应于 i + 1
列索引在 include_bias=False
数据中。 (例如 with_bias[:, 5] == without_bias[:, 4]
)
根据我的理解,1 列应该不会对决策树产生影响。我错过了什么?
来自documentation for DecisionTreeClassifier
:
random_state : int, RandomState instance, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter
is set to "best"
. When max_features < n_features
, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features
. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state
has to be fixed to an integer. See Glossary for details.
您已经设置了 random_state
,但是具有不同的列数仍然会使那些随机洗牌不同。请注意,gini
的值对于每个节点的两棵树都是相同的,即使不同的特征正在分裂也是如此。
我在随机生成的分类问题上测试 sklearn 的 Pipeline
:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
x, y = make_classification(n_samples=100, n_features=5, random_state=10)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)
model = DecisionTreeClassifier(random_state=0)
pipe = Pipeline(steps=[('scale', StandardScaler()),
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('model', model)])
pipe.fit(x_train, y_train)
pipe_pred = pipe.predict(x_test)
accuracy_score(y_test, pipe_pred)
这导致准确度得分 .85
。但是,当我将 PolynomialFeatures
参数 include_bias
更改为 True
,这只是将一列 1 插入到数组中时,准确度得分变为 .90
。为了可视化,下面我绘制了偏差为 True
和 False
:
当include_bias=True
时:True
当include_bias=False
时:False
这些图像是由 plot_tree(pipe['model'])
生成的。
除了在第 0 列中插入一个额外的 1 列 include_bias=True
之外,数据集是相同的。因此 include_bias=True
数据的列索引对应于 i + 1
列索引在 include_bias=False
数据中。 (例如 with_bias[:, 5] == without_bias[:, 4]
)
根据我的理解,1 列应该不会对决策树产生影响。我错过了什么?
来自documentation for DecisionTreeClassifier
:
random_state : int, RandomState instance, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even ifsplitter
is set to"best"
. Whenmax_features < n_features
, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even ifmax_features=n_features
. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting,random_state
has to be fixed to an integer. See Glossary for details.
您已经设置了 random_state
,但是具有不同的列数仍然会使那些随机洗牌不同。请注意,gini
的值对于每个节点的两棵树都是相同的,即使不同的特征正在分裂也是如此。