放置在树根中的变量的信息增益是多少？

Question

我正在尝试从 Stepic 解决这个问题：

Download a dataset with three variables: sex, exang, num. Imagine that we want to use a decision tree to classify whether or not a patient has heart disease (variable num) based on two criteria: sex and the presence / absence of angina pectoris (exang). Train a decision tree on this data, use entropy as a criterion. Specify what the Information Gain value will be for the variable that will be placed in the root of the tree. The answer must be a number with precision 3 decimal places.

我就是这么做的：

clf = tree.DecisionTreeClassifier()
clf.fit(X, y)
tree.plot_tree(clf, filled=True)

l_node = clf.tree_.children_left[0]
r_node = clf.tree_.children_left[1]
n1 = clf.tree_.n_node_samples[l_node]
n2 = clf.tree_.n_node_samples[r_node]
e1 = clf.tree_.impurity[l_node]
e2 = clf.tree_.impurity[r_node]
n = n1 + n2

ig = 0.996 - (n1 * e1 + n2 * e2) / n

信息增益为 0.607。但是当我输入信息增益时，答案不正确。我做错了什么？

Answer 1

您没有使用所需标准创建决策树：entropy。如果您不指定任何内容，该算法将默认使用 gini 标准（如您在图中所见）

代码应该是：

clf = tree.DecisionTreeClassifier(criterion="entropy")
clf.fit(X, y)

有时，改变标准可以改变你的树（在这种情况下我怀疑），但你将能够看到每次分裂的熵，这对你的任务非常有帮助。

放置在树根中的变量的信息增益是多少？

What will be the Information Gain for the variable that will be placed in the root of the tree?

python

pandas

scikit-learn