我的代码是否正确计算了数据集的 Entropy/Conditional 熵?

Does my code calculate the Entropy/Conditional Entropy of a data set correctly?

我正在写一个 java,我想用它来计算给定数据集时的熵、联合熵、条件熵等。有问题的class如下:

public class Entropy {

private Frequency<String> iFrequency = new Frequency<String>();
private Frequency<String> rFrequency = new Frequency<String>();

Entropy(){
    super();
}

public void setInterestedFrequency(List<String> interestedFrequency){
    for(String s: interestedFrequency){
        this.iFrequency.addValue(s);
    }
}

public void setReducingFrequency(List<String> reducingFrequency){
    for(String s:reducingFrequency){
        this.rFrequency.addValue(s);
    }
}

private double log(double num, int base){
   return Math.log(num)/Math.log(base);
}

public double entropy(List<String> data){

    double entropy = 0.0;
    double prob = 0.0;
    Frequency<String> frequency = new Frequency<String>();

    for(String s:data){
        frequency.addValue(s);
    }

    String[] keys = frequency.getKeys();

    for(int i=0;i<keys.length;i++){

        prob = frequency.getPct(keys[i]);
        entropy = entropy - prob * log(prob,2);
    }

    return entropy;
}

/*
* return conditional probability of P(interestedClass|reducingClass)
* */
public double conditionalProbability(List<String> interestedSet,
                                     List<String> reducingSet,
                                     String interestedClass,
                                     String reducingClass){
    List<Integer> conditionalData = new LinkedList<Integer>();

    if(iFrequency.getKeys().length==0){
        this.setInterestedFrequency(interestedSet);
    }

    if(rFrequency.getKeys().length==0){
        this.setReducingFrequency(reducingSet);
    }

    for(int i = 0;i<reducingSet.size();i++){
        if(reducingSet.get(i).equalsIgnoreCase(reducingClass)){
            if(interestedSet.get(i).equalsIgnoreCase(interestedClass)){
                conditionalData.add(i);
            }
        }
    }

    int numerator = conditionalData.size();
    int denominator = this.rFrequency.getNum(reducingClass);

    return (double)numerator/denominator;
}

public double jointEntropy(List<String> set1, List<String> set2){

    String[] set1Keys;
    String[] set2Keys;
    Double prob1;
    Double prob2;
    Double entropy = 0.0;

    if(this.iFrequency.getKeys().length==0){
        this.setInterestedFrequency(set1);
    }

    if(this.rFrequency.getKeys().length==0){
        this.setReducingFrequency(set2);
    }

    set1Keys = this.iFrequency.getKeys();
    set2Keys = this.rFrequency.getKeys();

    for(int i=0;i<set1Keys.length;i++){
        for(int j=0;j<set2Keys.length;j++){
            prob1 = iFrequency.getPct(set1Keys[i]);
            prob2 = rFrequency.getPct(set2Keys[j]);

            entropy = entropy - (prob1*prob2)*log((prob1*prob2),2);
        }
    }

    return entropy;
}

public double conditionalEntropy(List<String> interestedSet, List<String> reducingSet){

    double jointEntropy = jointEntropy(interestedSet,reducingSet);
    double reducingEntropyX = entropy(reducingSet);
    double conEntYgivenX = jointEntropy - reducingEntropyX;

    return conEntYgivenX;
}

在过去的几天里,我一直在试图弄清楚为什么我的熵计算几乎总是与我的条件熵计算完全相同。

我正在使用以下公式:

H(X) = - 从 x=1 到 x=n 的西格玛 p(x)*log(p(x))

H(XY) = - 从 x=1 到 x=n,y=1 到 y=m (p(x)*p(y)) * log(p(x)*p(y ))

H(X|Y) = H(XY) - H(X)

我得到的熵值和条件熵值几乎相同。

使用我用于测试的数据集,我得到以下值:

@Test
public void testEntropy(){
    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("lwt");

    Data discreteData = freshData.discretize(freshData.getData(),headersToChange,1,10);

    Entropy entropy = new Entropy();
    Double result = entropy.entropy(discreteData.getData().get("lwt"));
    assertEquals(2.48,result,.006);
}

@Test
public void testConditionalProbability(){

    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("age");
    headersToChange.add("lwt");


    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);

    Entropy entropy = new Entropy();
    double conditionalProb = entropy.conditionalProbability(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6");
    assertEquals(.1,conditionalProb,.005);
}

@Test
public void testJointEntropy(){


    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("age");
    headersToChange.add("lwt");

    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);

    Entropy entropy = new Entropy();
    double jointEntropy = entropy.jointEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"));
    assertEquals(5.05,jointEntropy,.006);
}

@Test
public void testSpecifiedConditionalEntropy(){

    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("age");
    headersToChange.add("lwt");

    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);

    Entropy entropy = new Entropy();
    double specCondiEntropy = entropy.specifiedConditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6");
    assertEquals(.332,specCondiEntropy,.005);

}

@Test
public void testConditionalEntropy(){

    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("age");
    headersToChange.add("lwt");

    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);

    Entropy entropy = new Entropy();
    Double result = entropy.conditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"));
    assertEquals(2.47,result,.006);
}

一切都正确编译,但我很确定我对条件熵的计算不正确,但我不确定我在哪里犯了错误。

单元测试中的值是我当前获得的值。它们与上述函数的输出相同。

有一次我还使用以下工具进行测试:

List<String> survived = Arrays.asList("1","0","1","1","0","1","0","0","0","1","0","1","0","0","1");
List<String> sex = Arrays.asList("0","1","0","1","1","0","0","1","1","0","1","0","0","1","1");

其中 male = 1 和 survived = 1。然后我用这个来计算

double result = entropy.entropy(survived);
assertEquals(.996,result,.006);

以及

double jointEntropy = entropy.jointEntropy(survived,sex);
assertEquals(1.99,jointEntropy,.006);

我还通过手工计算值来检查我的工作。你可以看到一张图片here。由于我的代码给出的值与我手动解决问题时得到的值相同,并且由于其他函数非常简单并且只使用了 entropy/joint 熵函数,所以我认为一切都很好。

但是,出了点问题。下面是我编写的另外两个函数,用于计算信息增益和集合的对称不确定性。

public double informationGain(List<String> interestedSet, List<String> reducingSet){
    double entropy = entropy(interestedSet);
    double conditionalEntropy = conditionalEntropy(interestedSet,reducingSet);
    double infoGain = entropy - conditionalEntropy;
    return infoGain;
}

public double symmetricalUncertainty(List<String> interestedSet, List<String> reducingSet){
    double infoGain = informationGain(interestedSet,reducingSet);
    double intSet = entropy(interestedSet);
    double redSet = entropy(reducingSet);
    double symUnc = 2 * ( infoGain/ (intSet+redSet) );
    return symUnc;
}

我使用的原始 survive/sex 集给了我一个略微否定的答案。但是因为它只有 .000000000000002 的负数,所以我只是假设这是一个舍入误差。当我尝试 运行 我的程序时,我得到的对称不确定性值的 none 是有意义的。

tldr;您对 H(X,Y) 的计算显然假设 X 和 Y 是独立的,这导致 H(X,Y) = H(X) + H(Y),这又导致您的 H(X|Y)等于 H(X)。

这是你的问题吗?如果是,则使用 X 和 Y 的联合熵的正确公式(取自 Wikipedia):

你用 P(X,Y) = P(X)P(Y) 代入错误,它假定两个变量都是独立的。

如果两个变量独立的,那么实际上 H(X|Y) = H(X) 成立,因为 Y 没有告诉你关于 X 的任何信息,因此知道Y 不会减少 X 的熵。

要计算单个向量的熵,您可以使用以下函数

Function<List<Double>, Double> entropy = 
    x-> {
        double sum= x.stream().mapToDouble(Double::doubleValue).sum();
        return - x.stream()
                    .map(y->y/sum)
                    .map(y->y*Math.log(y))
                    .mapToDouble(Double::doubleValue)
                    .sum();
    };

例如,使用向量 [1 2 3] 会得到 1.0114

的结果
double H = new Entropy().entropy.apply(Arrays.asList(new Double[] { 1.0, 2.0, 3.0 }));
System.out.println("Entropy H = "+ H);