DecisionTreeRegressor 等同于 predict_proba
Equivalent of predict_proba for DecisionTreeRegressor
scikit-learn 的 DecisionTreeClassifier
支持通过 predict_proba()
函数预测每个 class 的概率。 DecisionTreeRegressor
:
中没有
AttributeError: 'DecisionTreeRegressor' object has no attribute 'predict_proba'
我的理解是,决策树 classifier 和回归器之间的基本机制非常相似,主要区别在于回归器的预测是作为潜在叶子的均值计算的。所以我希望能够提取每个值的概率。
有没有其他方法来模拟这个,例如通过为 DecisionTreeClassifier
处理 tree structure? The code,predict_proba
不能直接转移。
您可以从树结构中获取该数据:
import sklearn
import numpy as np
import graphviz
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.datasets import make_regression
# Generate a simple dataset
X, y = make_regression(n_features=2, n_informative=2, random_state=0)
clf = DecisionTreeRegressor(random_state=0, max_depth=2)
clf.fit(X, y)
# Visualize the tree
graphviz.Source(sklearn.tree.export_graphviz(clf)).view()
>>> clf.predict(X[:5])
0 184.005667
1 53.017289
2 184.005667
3 -20.603498
4 -97.414461
如果您调用 clf.apply(X)
您将获得实例所属的节点 ID:
array([6, 5, 6, 3, 2, 5, 5, 3, 6, ... 5, 5, 6, 3, 2, 2, 5, 2, 2], dtype=int64)
将其与目标变量合并:
df = pd.DataFrame(np.vstack([y, clf.apply(X)]), index=['y','node_id']).T
y node_id
0 190.370562 6.0
1 13.339570 5.0
2 141.772669 6.0
3 -3.069627 3.0
4 -26.062465 2.0
5 54.922541 5.0
6 25.952881 5.0
...
现在,如果您在 node_id
上执行 groupby 后跟均值,您将获得与 clf.predict(X)
相同的值
>>> df.groupby('node_id').mean()
y
node_id
2.0 -97.414461
3.0 -20.603498
5.0 53.017289
6.0 184.005667
我们树中的 value
个叶子是什么:
>>> clf.tree_.value[6]
array([[184.00566679]])
要获取新数据集的节点 ID,您需要调用
clf.decision_path(X[:5]).toarray()
它显示了这样一个数组
array([[1, 0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 1, 0],
[1, 0, 0, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0]], dtype=int64)
你需要在哪里获取最后一个非零元素(即叶子)
>>> pd.DataFrame(clf.decision_path(X[:5]).toarray()).apply(lambda x:x.nonzero()[0].max(), axis=1)
0 6
1 5
2 6
3 3
4 2
dtype: int64
所以如果你想预测中位数而不是预测均值,你会做
>>> pd.DataFrame(clf.decision_path(X[:5]).toarray()).apply(lambda x: x.nonzero()[0].max(
), axis=1).to_frame(name='node_id').join(df.groupby('node_id').median(), on='node_id')['y']
0 181.381106
1 54.053170
2 181.381106
3 -28.591188
4 -93.891889
此函数改编自 的代码以提供每个结果的概率:
from sklearn.tree import DecisionTreeRegressor
import pandas as pd
def decision_tree_regressor_predict_proba(X_train, y_train, X_test, **kwargs):
"""Trains DecisionTreeRegressor model and predicts probabilities of each y.
Args:
X_train: Training features.
y_train: Training labels.
X_test: New data to predict on.
**kwargs: Other arguments passed to DecisionTreeRegressor.
Returns:
DataFrame with columns for record_id (row of X_test), y
(predicted value), and prob (of that y value).
The sum of prob equals 1 for each record_id.
"""
# Train model.
m = DecisionTreeRegressor(**kwargs).fit(X_train, y_train)
# Get y values corresponding to each node.
node_ys = pd.DataFrame({'node_id': m.apply(X_train), 'y': y_train})
# Calculate probability as 1 / number of y values per node.
node_ys['prob'] = 1 / node_ys.groupby(node_ys.node_id).transform('count')
# Aggregate per node-y, in case of multiple training records with the same y.
node_ys_dedup = node_ys.groupby(['node_id', 'y']).prob.sum().to_frame()\
.reset_index()
# Extract predicted leaf node for each new observation.
leaf = pd.DataFrame(m.decision_path(X_test).toarray()).apply(
lambda x:x.to_numpy().nonzero()[0].max(), axis=1).to_frame(
name='node_id')
leaf['record_id'] = leaf.index
# Merge with y values and drop node_id.
return leaf.merge(node_ys_dedup, on='node_id').drop(
'node_id', axis=1).sort_values(['record_id', 'y'])
示例(参见 this notebook):
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Works better with min_samples_leaf > 1.
res = decision_tree_regressor_predict_proba(X_train, y_train, X_test,
random_state=0, min_samples_leaf=5)
res[res.record_id == 2]
# record_id y prob
# 25 2 20.6 0.166667
# 26 2 22.3 0.166667
# 27 2 22.7 0.166667
# 28 2 23.8 0.333333
# 29 2 25.0 0.166667
scikit-learn 的 DecisionTreeClassifier
支持通过 predict_proba()
函数预测每个 class 的概率。 DecisionTreeRegressor
:
AttributeError: 'DecisionTreeRegressor' object has no attribute 'predict_proba'
我的理解是,决策树 classifier 和回归器之间的基本机制非常相似,主要区别在于回归器的预测是作为潜在叶子的均值计算的。所以我希望能够提取每个值的概率。
有没有其他方法来模拟这个,例如通过为 DecisionTreeClassifier
处理 tree structure? The code,predict_proba
不能直接转移。
您可以从树结构中获取该数据:
import sklearn
import numpy as np
import graphviz
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.datasets import make_regression
# Generate a simple dataset
X, y = make_regression(n_features=2, n_informative=2, random_state=0)
clf = DecisionTreeRegressor(random_state=0, max_depth=2)
clf.fit(X, y)
# Visualize the tree
graphviz.Source(sklearn.tree.export_graphviz(clf)).view()
>>> clf.predict(X[:5])
0 184.005667
1 53.017289
2 184.005667
3 -20.603498
4 -97.414461
如果您调用 clf.apply(X)
您将获得实例所属的节点 ID:
array([6, 5, 6, 3, 2, 5, 5, 3, 6, ... 5, 5, 6, 3, 2, 2, 5, 2, 2], dtype=int64)
将其与目标变量合并:
df = pd.DataFrame(np.vstack([y, clf.apply(X)]), index=['y','node_id']).T
y node_id
0 190.370562 6.0
1 13.339570 5.0
2 141.772669 6.0
3 -3.069627 3.0
4 -26.062465 2.0
5 54.922541 5.0
6 25.952881 5.0
...
现在,如果您在 node_id
上执行 groupby 后跟均值,您将获得与 clf.predict(X)
>>> df.groupby('node_id').mean()
y
node_id
2.0 -97.414461
3.0 -20.603498
5.0 53.017289
6.0 184.005667
我们树中的 value
个叶子是什么:
>>> clf.tree_.value[6]
array([[184.00566679]])
要获取新数据集的节点 ID,您需要调用
clf.decision_path(X[:5]).toarray()
它显示了这样一个数组
array([[1, 0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 1, 0],
[1, 0, 0, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0]], dtype=int64)
你需要在哪里获取最后一个非零元素(即叶子)
>>> pd.DataFrame(clf.decision_path(X[:5]).toarray()).apply(lambda x:x.nonzero()[0].max(), axis=1)
0 6
1 5
2 6
3 3
4 2
dtype: int64
所以如果你想预测中位数而不是预测均值,你会做
>>> pd.DataFrame(clf.decision_path(X[:5]).toarray()).apply(lambda x: x.nonzero()[0].max(
), axis=1).to_frame(name='node_id').join(df.groupby('node_id').median(), on='node_id')['y']
0 181.381106
1 54.053170
2 181.381106
3 -28.591188
4 -93.891889
此函数改编自
from sklearn.tree import DecisionTreeRegressor
import pandas as pd
def decision_tree_regressor_predict_proba(X_train, y_train, X_test, **kwargs):
"""Trains DecisionTreeRegressor model and predicts probabilities of each y.
Args:
X_train: Training features.
y_train: Training labels.
X_test: New data to predict on.
**kwargs: Other arguments passed to DecisionTreeRegressor.
Returns:
DataFrame with columns for record_id (row of X_test), y
(predicted value), and prob (of that y value).
The sum of prob equals 1 for each record_id.
"""
# Train model.
m = DecisionTreeRegressor(**kwargs).fit(X_train, y_train)
# Get y values corresponding to each node.
node_ys = pd.DataFrame({'node_id': m.apply(X_train), 'y': y_train})
# Calculate probability as 1 / number of y values per node.
node_ys['prob'] = 1 / node_ys.groupby(node_ys.node_id).transform('count')
# Aggregate per node-y, in case of multiple training records with the same y.
node_ys_dedup = node_ys.groupby(['node_id', 'y']).prob.sum().to_frame()\
.reset_index()
# Extract predicted leaf node for each new observation.
leaf = pd.DataFrame(m.decision_path(X_test).toarray()).apply(
lambda x:x.to_numpy().nonzero()[0].max(), axis=1).to_frame(
name='node_id')
leaf['record_id'] = leaf.index
# Merge with y values and drop node_id.
return leaf.merge(node_ys_dedup, on='node_id').drop(
'node_id', axis=1).sort_values(['record_id', 'y'])
示例(参见 this notebook):
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Works better with min_samples_leaf > 1.
res = decision_tree_regressor_predict_proba(X_train, y_train, X_test,
random_state=0, min_samples_leaf=5)
res[res.record_id == 2]
# record_id y prob
# 25 2 20.6 0.166667
# 26 2 22.3 0.166667
# 27 2 22.7 0.166667
# 28 2 23.8 0.333333
# 29 2 25.0 0.166667