放置在树根中的变量的信息增益是多少?

What will be the Information Gain for the variable that will be placed in the root of the tree?

我正在尝试从 Stepic 解决这个问题:

Download a dataset with three variables: sex, exang, num. Imagine that we want to use a decision tree to classify whether or not a patient has heart disease (variable num) based on two criteria: sex and the presence / absence of angina pectoris (exang). Train a decision tree on this data, use entropy as a criterion. Specify what the Information Gain value will be for the variable that will be placed in the root of the tree. The answer must be a number with precision 3 decimal places.

我就是这么做的:

clf = tree.DecisionTreeClassifier()
clf.fit(X, y)
tree.plot_tree(clf, filled=True)

l_node = clf.tree_.children_left[0]
r_node = clf.tree_.children_left[1]
n1 = clf.tree_.n_node_samples[l_node]
n2 = clf.tree_.n_node_samples[r_node]
e1 = clf.tree_.impurity[l_node]
e2 = clf.tree_.impurity[r_node]
n = n1 + n2

ig = 0.996 - (n1 * e1 + n2 * e2) / n

信息增益为 0.607。但是当我输入信息增益时,答案不正确。 我做错了什么?

您没有使用所需标准创建决策树:entropy。如果您不指定任何内容,该算法将默认使用 gini 标准(如您在图中所见)

代码应该是:

clf = tree.DecisionTreeClassifier(criterion="entropy")
clf.fit(X, y)

有时,改变标准可以改变你的树(在这种情况下我怀疑),但你将能够看到每次分裂的熵,这对你的任务非常有帮助。