为什么 xgboost 树的相同路径会给出 2 个不同的预测?

Why same paths of xgboost tree give 2 different predictions?

我正在尝试调查 xgboost 预测。

似乎具有相同 2 条路径的 2 个输入给出了 2 个不同的预测。

我 运行 日期如下:

f1,f2,f3,f4,f5,f6,f7,f8,y
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
1,115,70,30,96,34.6,0.529,32,1
3,126,88,41,235,39.3,0.704,27,0
8,99,84,0,0,35.4,0.388,50,0
7,196,90,0,0,39.8,0.451,41,1
9,119,80,35,0,29.0,0.263,29,1
11,143,94,33,146,36.6,0.254,51,1
10,125,70,26,115,31.1,0.205,41,1
7,147,76,0,0,39.4,0.257,43,1
1,97,66,15,140,23.2,0.487,22,0
13,145,82,19,110,22.2,0.245,57,0
5,117,92,0,0,34.1,0.337,38,0
5,109,75,26,0,36.0,0.546,60,0
3,158,76,36,245,31.6,0.851,28,1
3,88,58,11,54,24.8,0.267,22,0
6,92,92,0,0,19.9,0.188,28,0
10,122,78,31,0,27.6,0.512,45,0
4,103,60,33,192,24.0,0.966,33,0
11,138,76,0,0,33.2,0.420,35,0
9,102,76,37,0,32.9,0.665,46,1
2,90,68,42,0,38.2,0.503,27,1

预测和树创建代码:

df = pd.read_csv("input.csv")
x = df[['f1','f2','f3', 'f4', 'f5', 'f6','f7','f8']]
y = df[['y']]
X_train, X_test, y_train, y_test = train_test_split( x, y, test_size = 0.33, random_state = 42)
model = XGBClassifier(n_jobs=-1)
model.fit(X_train, y_train)
res = model.predict(X_test)
print ("X_test (first 2 rows:")
print(X_test.head(2))
print("Predictions (first 2 rows:")
print(res[0:2])    
plot_tree(model)
plt.show()

输出:

X_test (first 2 rows:
    f1   f2  f3  f4  f5    f6     f7  f8
33   6   92  92   0   0  19.9  0.188  28
36  11  138  76   0   0  33.2  0.420  35
Predictions (first 2 rows:
[0 1]

相同的 2 个输入有 f2<146.5f4=0 => 进入同一个叶子 (-0.34) 那么为什么对这两个的预测不同呢? (0 and 1) ?

您在 中绘制的内容不是 整个 XGBoost 模型;这只是它的第一棵树。

要了解为什么会这样,请查看 plot_treesource code

def plot_tree(booster, fmap='', num_trees=0, rankdir=None, ax=None, **kwargs):
    """Plot specified tree.

documentation:

num_trees (int, default 0) – Specify the ordinal number of target tree

从这里可以明显看出,当你没有指定 num_trees 参数时,就像这里一样,它采用默认值 0,即集合的第一棵树。

num_trees 使用不同的值,您将得到不同的树,因此每个样本的决策路径不同。

您无法绘制增强集成的所有树(即使可以,也没有任何实际用途)。 plot_tree 只是一个效用函数,以便能够查看模型的各个树。您可以在 How to Visualize Gradient Boosting Decision Trees With XGBoost in Python.

中查看它的用法