如何将 sklearn 决策树规则提取到 pandas 布尔条件?
How to extract sklearn decision tree rules to pandas boolean conditions?
关于如何提取 sklearn 决策树规则的帖子 like this 太多了,但我找不到任何关于使用 pandas.
的帖子
以this data and model为例,如下
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
结果:
预计:
这个例子有 8 条规则。
从左到右,注意dataframe是df
r1 = (df['glucose']<=127.5) & (df['bmi']<=26.45) & (df['bmi']<=9.1)
……
r8 = (df['glucose']>127.5) & (df['bmi']>28.15) & (df['glucose']>158.5)
我不是提取 sklearn 决策树规则的高手。获取 pandas 布尔条件将帮助我计算每个规则的样本和其他指标。所以我想将每个规则提取到一个 pandas 布尔条件。
首先让我们在决策树结构上使用 scikit documentation 来获取有关构建的树的信息:
n_nodes = clf.tree_.node_count
children_left = clf.tree_.children_left
children_right = clf.tree_.children_right
feature = clf.tree_.feature
threshold = clf.tree_.threshold
然后我们定义两个递归函数。第一个将找到从树根开始的路径以创建特定节点(在我们的例子中是所有叶子)。第二个将使用其创建路径编写用于创建节点的特定规则:
def find_path(node_numb, path, x):
path.append(node_numb)
if node_numb == x:
return True
left = False
right = False
if (children_left[node_numb] !=-1):
left = find_path(children_left[node_numb], path, x)
if (children_right[node_numb] !=-1):
right = find_path(children_right[node_numb], path, x)
if left or right :
return True
path.remove(node_numb)
return False
def get_rule(path, column_names):
mask = ''
for index, node in enumerate(path):
#We check if we are not in the leaf
if index!=len(path)-1:
# Do we go under or over the threshold ?
if (children_left[node] == path[index+1]):
mask += "(df['{}']<= {}) \t ".format(column_names[feature[node]], threshold[node])
else:
mask += "(df['{}']> {}) \t ".format(column_names[feature[node]], threshold[node])
# We insert the & at the right places
mask = mask.replace("\t", "&", mask.count("\t") - 1)
mask = mask.replace("\t", "")
return mask
最后,我们用这两个函数先存储每片叶子的创建路径。然后存储用于创建每个叶子的规则:
# Leaves
leave_id = clf.apply(X_test)
paths ={}
for leaf in np.unique(leave_id):
path_leaf = []
find_path(0, path_leaf, leaf)
paths[leaf] = np.unique(np.sort(path_leaf))
rules = {}
for key in paths:
rules[key] = get_rule(paths[key], pima.columns)
根据您提供的数据,输出为:
rules =
{3: "(df['insulin']<= 127.5) & (df['bp']<= 26.450000762939453) & (df['bp']<= 9.100000381469727) ",
4: "(df['insulin']<= 127.5) & (df['bp']<= 26.450000762939453) & (df['bp']> 9.100000381469727) ",
6: "(df['insulin']<= 127.5) & (df['bp']> 26.450000762939453) & (df['skin']<= 27.5) ",
7: "(df['insulin']<= 127.5) & (df['bp']> 26.450000762939453) & (df['skin']> 27.5) ",
10: "(df['insulin']> 127.5) & (df['bp']<= 28.149999618530273) & (df['insulin']<= 145.5) ",
11: "(df['insulin']> 127.5) & (df['bp']<= 28.149999618530273) & (df['insulin']> 145.5) ",
13: "(df['insulin']> 127.5) & (df['bp']> 28.149999618530273) & (df['insulin']<= 158.5) ",
14: "(df['insulin']> 127.5) & (df['bp']> 28.149999618530273) & (df['insulin']> 158.5) "}
由于规则是字符串,不能直接使用df[rules[3]]
调用它们,必须像这样使用eval函数df[eval(rules[3])]
我找到了这个问题的进一步解决方案(vlemaistre 发布的第二部分),它允许用户 运行 通过任何节点并根据 pandas 对数据进行子集化布尔条件。
node_id = 3
def datatree_path_summarystats(node_id):
for k, v in paths.items():
if node_id in v:
d = k,v
ruleskey = d[0]
numberofsteps = sum(map(lambda x : x<node_id, d[1]))
for k, v in rules.items():
if k == ruleskey:
b = k,v
stringsubset = b[1]
datasubset = "&".join(stringsubset.split('&')[:numberofsteps])
return datasubset
datasubset = datatree_path_summarystats(node_id)
df[eval(datasubset)]
此功能运行通过包含您要查找的节点 ID 的路径。然后它将根据该节点数拆分规则,创建逻辑以根据该特定节点对数据帧进行子集化。
现在您可以使用 export_text。
from sklearn.tree import export_text
r = export_text(loan_tree, feature_names=(list(X_train.columns)))
print(r)
来自 sklearn
的完整示例
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
iris = load_iris()
X = iris['data']
y = iris['target']
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)
关于如何提取 sklearn 决策树规则的帖子 like this 太多了,但我找不到任何关于使用 pandas.
的帖子以this data and model为例,如下
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
结果:
预计:
这个例子有 8 条规则。
从左到右,注意dataframe是df
r1 = (df['glucose']<=127.5) & (df['bmi']<=26.45) & (df['bmi']<=9.1)
……
r8 = (df['glucose']>127.5) & (df['bmi']>28.15) & (df['glucose']>158.5)
我不是提取 sklearn 决策树规则的高手。获取 pandas 布尔条件将帮助我计算每个规则的样本和其他指标。所以我想将每个规则提取到一个 pandas 布尔条件。
首先让我们在决策树结构上使用 scikit documentation 来获取有关构建的树的信息:
n_nodes = clf.tree_.node_count
children_left = clf.tree_.children_left
children_right = clf.tree_.children_right
feature = clf.tree_.feature
threshold = clf.tree_.threshold
然后我们定义两个递归函数。第一个将找到从树根开始的路径以创建特定节点(在我们的例子中是所有叶子)。第二个将使用其创建路径编写用于创建节点的特定规则:
def find_path(node_numb, path, x):
path.append(node_numb)
if node_numb == x:
return True
left = False
right = False
if (children_left[node_numb] !=-1):
left = find_path(children_left[node_numb], path, x)
if (children_right[node_numb] !=-1):
right = find_path(children_right[node_numb], path, x)
if left or right :
return True
path.remove(node_numb)
return False
def get_rule(path, column_names):
mask = ''
for index, node in enumerate(path):
#We check if we are not in the leaf
if index!=len(path)-1:
# Do we go under or over the threshold ?
if (children_left[node] == path[index+1]):
mask += "(df['{}']<= {}) \t ".format(column_names[feature[node]], threshold[node])
else:
mask += "(df['{}']> {}) \t ".format(column_names[feature[node]], threshold[node])
# We insert the & at the right places
mask = mask.replace("\t", "&", mask.count("\t") - 1)
mask = mask.replace("\t", "")
return mask
最后,我们用这两个函数先存储每片叶子的创建路径。然后存储用于创建每个叶子的规则:
# Leaves
leave_id = clf.apply(X_test)
paths ={}
for leaf in np.unique(leave_id):
path_leaf = []
find_path(0, path_leaf, leaf)
paths[leaf] = np.unique(np.sort(path_leaf))
rules = {}
for key in paths:
rules[key] = get_rule(paths[key], pima.columns)
根据您提供的数据,输出为:
rules =
{3: "(df['insulin']<= 127.5) & (df['bp']<= 26.450000762939453) & (df['bp']<= 9.100000381469727) ",
4: "(df['insulin']<= 127.5) & (df['bp']<= 26.450000762939453) & (df['bp']> 9.100000381469727) ",
6: "(df['insulin']<= 127.5) & (df['bp']> 26.450000762939453) & (df['skin']<= 27.5) ",
7: "(df['insulin']<= 127.5) & (df['bp']> 26.450000762939453) & (df['skin']> 27.5) ",
10: "(df['insulin']> 127.5) & (df['bp']<= 28.149999618530273) & (df['insulin']<= 145.5) ",
11: "(df['insulin']> 127.5) & (df['bp']<= 28.149999618530273) & (df['insulin']> 145.5) ",
13: "(df['insulin']> 127.5) & (df['bp']> 28.149999618530273) & (df['insulin']<= 158.5) ",
14: "(df['insulin']> 127.5) & (df['bp']> 28.149999618530273) & (df['insulin']> 158.5) "}
由于规则是字符串,不能直接使用df[rules[3]]
调用它们,必须像这样使用eval函数df[eval(rules[3])]
我找到了这个问题的进一步解决方案(vlemaistre 发布的第二部分),它允许用户 运行 通过任何节点并根据 pandas 对数据进行子集化布尔条件。
node_id = 3
def datatree_path_summarystats(node_id):
for k, v in paths.items():
if node_id in v:
d = k,v
ruleskey = d[0]
numberofsteps = sum(map(lambda x : x<node_id, d[1]))
for k, v in rules.items():
if k == ruleskey:
b = k,v
stringsubset = b[1]
datasubset = "&".join(stringsubset.split('&')[:numberofsteps])
return datasubset
datasubset = datatree_path_summarystats(node_id)
df[eval(datasubset)]
此功能运行通过包含您要查找的节点 ID 的路径。然后它将根据该节点数拆分规则,创建逻辑以根据该特定节点对数据帧进行子集化。
现在您可以使用 export_text。
from sklearn.tree import export_text
r = export_text(loan_tree, feature_names=(list(X_train.columns)))
print(r)
来自 sklearn
的完整示例from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
iris = load_iris()
X = iris['data']
y = iris['target']
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)