计算信息增益的输入形状

Question

我想在 20_newsgroup 数据集上计算 information gain。

我正在使用代码（我还把代码的副本放在了问题的下方）。

如您所见，算法的输入是 X,y 我的困惑是，X 将成为 matrix，行中有 documents，features as column。（根据 20_newsgroup 是 11314,1000 万一我只考虑了 1000 个特征）。

但是根据信息增益的概念，应该计算each feature.

的信息增益

（所以我希望看到代码以某种方式循环遍历每个特征，所以函数的输入是一个矩阵，其中行是特征，列是 class）

但是这里X不是feature而是X代表文档，我在代码中看不到处理这部分的部分！（我的意思是考虑每个文档，然后遍历该文档的每个特征；就像遍历行但同时遍历列，因为特征存储在列中）。

我已经阅读了this and this和许多类似的问题，但它们在输入矩阵形状方面并不清楚。

这是阅读代码 20_newsgroup:

newsgroup_train = fetch_20newsgroups(subset='train')
X,y = newsgroup_train.data,newsgroup_train.target

cv = CountVectorizer(max_df=0.99,min_df=0.001, max_features=1000,stop_words='english',lowercase=True,analyzer='word')
X_vec = cv.fit_transform(X)

(X_vec.shape) 是 (11314,1000)，它不是 20_newsgroup 数据集中的特征。我在想我是不是以错误的方式计算了信息增益？

这是 Information gain 的代码：

def information_gain(X, y):

    def _calIg():
        entropy_x_set = 0
        entropy_x_not_set = 0
        for c in classCnt:
            probs = classCnt[c] / float(featureTot)
            entropy_x_set = entropy_x_set - probs * np.log(probs)
            probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
            entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
        for c in classTotCnt:
            if c not in classCnt:
                probs = classTotCnt[c] / float(tot - featureTot)
                entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
        return entropy_before - ((featureTot / float(tot)) * entropy_x_set
                             +  ((tot - featureTot) / float(tot)) * entropy_x_not_set)

    tot = X.shape[0]
    classTotCnt = {}
    entropy_before = 0
    for i in y:
        if i not in classTotCnt:
            classTotCnt[i] = 1
        else:
            classTotCnt[i] = classTotCnt[i] + 1
    for c in classTotCnt:
        probs = classTotCnt[c] / float(tot)
        entropy_before = entropy_before - probs * np.log(probs)

    nz = X.T.nonzero()
    pre = 0
    classCnt = {}
    featureTot = 0
    information_gain = []
    for i in range(0, len(nz[0])):
        if (i != 0 and nz[0][i] != pre):
            for notappear in range(pre+1, nz[0][i]):
                information_gain.append(0)
            ig = _calIg()
            information_gain.append(ig)
            pre = nz[0][i]
            classCnt = {}
            featureTot = 0
        featureTot = featureTot + 1
        yclass = y[nz[1][i]]
        if yclass not in classCnt:
            classCnt[yclass] = 1
        else:
            classCnt[yclass] = classCnt[yclass] + 1
    ig = _calIg()
    information_gain.append(ig)

    return np.asarray(information_gain)

Answer 1

嗯，详细看完代码后，我对X.T.nonzero()有了更多的了解。

其实信息增益需要循环特征是正确的。同样，这里给我们的矩阵 scikit-learn 是基于 doc-features.

的

但是：

在代码中它使用 X.T.nonzero() 技术上将所有非零值转换为数组。然后在下一行循环遍历该数组范围的长度（0，len(X.T.nonzero()[0])。

总的来说，这部分 X.T.nonzero()[0] 将所有 none 零特征返回给我们:)

计算信息增益的输入形状

shape of input to calculate information gain

machine-learning

entropy

feature-extraction

feature-selection

scikit-learn