如何阅读 graphviz 决策树?
How to read a graphviz decision tree?
我有一个使用 ScikitLearn 的 export_graphviz 函数获得的决策树 graphviz 文件。为了简单起见,我将深度限制为 3,所以我得到了这个输出:
digraph Tree {
node [shape=box] ;
0 [label="userAcceleration-magnitude-mean <= 0.973\ngini = 0.875\nsamples = 3878\nvalue = [471, 467, 485, 484, 486, 486, 513, 486]\nclass = Walking"] ;
1 [label="userAcceleration-x-IQR <= 0.073\ngini = 0.834\nsamples = 2881\nvalue = [471, 443, 476, 484, 9, 486, 512, 0]\nclass = Walking"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="rotationRate-z-IQR <= 0.396\ngini = 0.606\nsamples = 1020\nvalue = [466, 80, 43, 2, 0, 429, 0, 0]\nclass = Push-ups"] ;
1 -> 2 ;
3 [label="gini = 0.355\nsamples = 515\nvalue = [5, 74, 28, 2, 0, 406, 0, 0]\nclass = Resting"] ;
2 -> 3 ;
4 [label="gini = 0.164\nsamples = 505\nvalue = [461, 6, 15, 0, 0, 23, 0, 0]\nclass = Push-ups"] ;
2 -> 4 ;
5 [label="rotationRate-magnitude-median <= 0.844\ngini = 0.764\nsamples = 1861\nvalue = [5, 363, 433, 482, 9, 57, 512, 0]\nclass = Walking"] ;
1 -> 5 ;
6 [label="gini = 0.596\nsamples = 974\nvalue = [2, 73, 388, 476, 0, 23, 12, 0]\nclass = Lunges"] ;
5 -> 6 ;
7 [label="gini = 0.571\nsamples = 887\nvalue = [3, 290, 45, 6, 9, 34, 500, 0]\nclass = Walking"] ;
5 -> 7 ;
8 [label="userAcceleration-y-max <= 2.702\ngini = 0.533\nsamples = 997\nvalue = [0, 24, 9, 0, 477, 0, 1, 486]\nclass = Running"] ;
0 -> 8 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
9 [label="rotationRate-z-IQR <= 2.4\ngini = 0.236\nsamples = 536\nvalue = [0, 22, 6, 0, 466, 0, 1, 41]\nclass = Jump Rope"] ;
8 -> 9 ;
10 [label="gini = 0.553\nsamples = 53\nvalue = [0, 11, 6, 0, 3, 0, 0, 33]\nclass = Running"] ;
9 -> 10 ;
11 [label="gini = 0.08\nsamples = 483\nvalue = [0, 11, 0, 0, 463, 0, 1, 8]\nclass = Jump Rope"] ;
9 -> 11 ;
12 [label="altitude-median <= 5.0\ngini = 0.068\nsamples = 461\nvalue = [0, 2, 3, 0, 11, 0, 0, 445]\nclass = Running"] ;
8 -> 12 ;
13 [label="gini = 0.0\nsamples = 445\nvalue = [0, 0, 0, 0, 0, 0, 0, 445]\nclass = Running"] ;
12 -> 13 ;
14 [label="gini = 0.477\nsamples = 16\nvalue = [0, 2, 3, 0, 11, 0, 0, 0]\nclass = Jump Rope"] ;
12 -> 14 ;
}
让我们专注于前 2 个节点:
0 [label="userAcceleration-magnitude-mean <= 0.973\ngini = 0.875\nsamples = 3878\nvalue = [471, 467, 485, 484, 486, 486, 513, 486]\nclass = Walking"] ;
1 [label="userAcceleration-x-IQR <= 0.073\ngini = 0.834\nsamples = 2881\nvalue = [471, 443, 476, 484, 9, 486, 512, 0]\nclass = Walking"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="rotationRate-z-IQR <= 0.396\ngini = 0.606\nsamples = 1020\nvalue = [466, 80, 43, 2, 0, 429, 0, 0]\nclass = Push-ups"] ;
1 -> 2 ;
这是我不明白的地方:
- 如果user-acceleration-magnitude-mean小于等于0.973,那么class就是"Walking",否则我跳到节点1对吧?还是反过来?
- 如何读取以 "gini = 0.596" 开头的标签? gini 不是我的决策树的特征,这是什么意思?
- nsamples 和 nvalue 等其他值呢?他们代表什么?
1:如果user-acceleration-magnitude-mean小于等于0.973,则遵循True。 (这继续向下进入树。)
2:我用谷歌搜索了一下,找到了 "gini coefficient: a statistical measure of the degree of variation represented in a set of values, used especially in analysing income inequality"。我认为它已经脱离了经济背景,但我不确定是否属于这种情况。
3:
这里面有一个底层结构。 samples
是申请该节点的样本数量。根有 3878 个样本,左边有 2881 个 child,右边有 997 个 child。由于 2881 + 997 = 3878 我相信对于 2881 个样本 user-acceleration-magnitude-mean <= 0.973
是正确的。其他 997 个样本分别为 False。
这些价值观也有某种潜在的结构在发生。 value
列表中每个值的总和等于该节点中 samples
的数量。
我有一个使用 ScikitLearn 的 export_graphviz 函数获得的决策树 graphviz 文件。为了简单起见,我将深度限制为 3,所以我得到了这个输出:
digraph Tree {
node [shape=box] ;
0 [label="userAcceleration-magnitude-mean <= 0.973\ngini = 0.875\nsamples = 3878\nvalue = [471, 467, 485, 484, 486, 486, 513, 486]\nclass = Walking"] ;
1 [label="userAcceleration-x-IQR <= 0.073\ngini = 0.834\nsamples = 2881\nvalue = [471, 443, 476, 484, 9, 486, 512, 0]\nclass = Walking"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="rotationRate-z-IQR <= 0.396\ngini = 0.606\nsamples = 1020\nvalue = [466, 80, 43, 2, 0, 429, 0, 0]\nclass = Push-ups"] ;
1 -> 2 ;
3 [label="gini = 0.355\nsamples = 515\nvalue = [5, 74, 28, 2, 0, 406, 0, 0]\nclass = Resting"] ;
2 -> 3 ;
4 [label="gini = 0.164\nsamples = 505\nvalue = [461, 6, 15, 0, 0, 23, 0, 0]\nclass = Push-ups"] ;
2 -> 4 ;
5 [label="rotationRate-magnitude-median <= 0.844\ngini = 0.764\nsamples = 1861\nvalue = [5, 363, 433, 482, 9, 57, 512, 0]\nclass = Walking"] ;
1 -> 5 ;
6 [label="gini = 0.596\nsamples = 974\nvalue = [2, 73, 388, 476, 0, 23, 12, 0]\nclass = Lunges"] ;
5 -> 6 ;
7 [label="gini = 0.571\nsamples = 887\nvalue = [3, 290, 45, 6, 9, 34, 500, 0]\nclass = Walking"] ;
5 -> 7 ;
8 [label="userAcceleration-y-max <= 2.702\ngini = 0.533\nsamples = 997\nvalue = [0, 24, 9, 0, 477, 0, 1, 486]\nclass = Running"] ;
0 -> 8 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
9 [label="rotationRate-z-IQR <= 2.4\ngini = 0.236\nsamples = 536\nvalue = [0, 22, 6, 0, 466, 0, 1, 41]\nclass = Jump Rope"] ;
8 -> 9 ;
10 [label="gini = 0.553\nsamples = 53\nvalue = [0, 11, 6, 0, 3, 0, 0, 33]\nclass = Running"] ;
9 -> 10 ;
11 [label="gini = 0.08\nsamples = 483\nvalue = [0, 11, 0, 0, 463, 0, 1, 8]\nclass = Jump Rope"] ;
9 -> 11 ;
12 [label="altitude-median <= 5.0\ngini = 0.068\nsamples = 461\nvalue = [0, 2, 3, 0, 11, 0, 0, 445]\nclass = Running"] ;
8 -> 12 ;
13 [label="gini = 0.0\nsamples = 445\nvalue = [0, 0, 0, 0, 0, 0, 0, 445]\nclass = Running"] ;
12 -> 13 ;
14 [label="gini = 0.477\nsamples = 16\nvalue = [0, 2, 3, 0, 11, 0, 0, 0]\nclass = Jump Rope"] ;
12 -> 14 ;
}
让我们专注于前 2 个节点:
0 [label="userAcceleration-magnitude-mean <= 0.973\ngini = 0.875\nsamples = 3878\nvalue = [471, 467, 485, 484, 486, 486, 513, 486]\nclass = Walking"] ;
1 [label="userAcceleration-x-IQR <= 0.073\ngini = 0.834\nsamples = 2881\nvalue = [471, 443, 476, 484, 9, 486, 512, 0]\nclass = Walking"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="rotationRate-z-IQR <= 0.396\ngini = 0.606\nsamples = 1020\nvalue = [466, 80, 43, 2, 0, 429, 0, 0]\nclass = Push-ups"] ;
1 -> 2 ;
这是我不明白的地方:
- 如果user-acceleration-magnitude-mean小于等于0.973,那么class就是"Walking",否则我跳到节点1对吧?还是反过来?
- 如何读取以 "gini = 0.596" 开头的标签? gini 不是我的决策树的特征,这是什么意思?
- nsamples 和 nvalue 等其他值呢?他们代表什么?
1:如果user-acceleration-magnitude-mean小于等于0.973,则遵循True。 (这继续向下进入树。)
2:我用谷歌搜索了一下,找到了 "gini coefficient: a statistical measure of the degree of variation represented in a set of values, used especially in analysing income inequality"。我认为它已经脱离了经济背景,但我不确定是否属于这种情况。
3:
这里面有一个底层结构。 samples
是申请该节点的样本数量。根有 3878 个样本,左边有 2881 个 child,右边有 997 个 child。由于 2881 + 997 = 3878 我相信对于 2881 个样本 user-acceleration-magnitude-mean <= 0.973
是正确的。其他 997 个样本分别为 False。
这些价值观也有某种潜在的结构在发生。 value
列表中每个值的总和等于该节点中 samples
的数量。