计算调整后的兰特指数

Question

我正在尝试计算两组集群之间的 ARI，使用以下代码：

#computes ARI for this type of clustering
def ARI(table,n):

index = 0
sum_a = 0
sum_b = 0
for i in range(len(table)-1):
    for j in range(len(table)-1):
        sum_a += choose(table[i][len(table)-1],2)
        sum_b += choose(table[len(table)-1][j],2)
        index += choose(table[i][j],2)


expected_index = (sum_a*sum_b)
expected_index = expected_index/choose(n,2)
max_index = (sum_a+sum_b)
max_index = max_index/2

return (index - expected_index)/(max_index-expected_index)


#choose to compute rand
def choose(n,r):

f = math.factorial
if (n-r)>=0:
    return f(n) // f(r) // f(n-r)
else:
    return 0

假设我已经正确创建了意外事件 table，我仍然得到 (-1,1) 范围之外的值。

例如：

应急费用table：

[1, 0, 0, 0, 0, 0, 0, 1]
[1, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 1, 0, 0, 0, 1]
[0, 1, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 0, 1, 1, 2]
[1, 0, 1, 0, 1, 0, 0, 3]
[0, 0, 0, 0, 0, 0, 1, 1]
[3, 1, 1, 1, 1, 1, 2, 0]

当我运行我的代码时，

产生 -1.6470588235294115 的 ARI。这段代码有错误吗？

此外，这是我计算列联矩阵的方式：

table = [[0 for _ in range(len(subjects)+1)]for _ in range(len(subjects)+1)]
#comparing all clusters
for i in range(len(clusters)):
    index_count = 0
    for subject, orgininsts in orig_clusters.items():
        madeinsts = clusters[i].instances
        intersect_count = 0
        #comparing all instances between the 2 clusters
        for orginst in orgininsts:
            for madeinst in makeinsts:
                if orginst == madeinst:
                    intersect_count += 1

        table[index_count][i] = intersect_count
        index_count += 1


for i in range(len(table)-1):
    a = 0
    b = 0
    for j in range(len(table)-1):
        a += table[i][j]
        b += table[j][i]

    table[i][len(table)-1] = a
    table[len(table)-1][i] = b

clusters 是具有属性 instances 的集群对象列表，它是包含在该集群中的实例列表。 orig_clusters 是一个字典，键代表集群标签，值是该集群中包含的实例列表。这段代码有错误吗？

Answer 1

您在计算代码中的 ARI 时犯了一些错误 -- 您计算 a 和 b 的次数太频繁了，因为您将 table 循环了两次而不是一次。

此外，您将 n 作为参数传递，但显然它设置为 10（这就是我得到结果的方式）。只传递 table 然后从那里计算 n 会更容易。我稍微修正了你的代码：

def ARI(table):
    index = 0
    sum_a = 0
    sum_b = 0
    n = sum([sum(subrow) for subrow in table]) #all items summed

    for i in range(len(table)):
        b_row = 0#this is to hold the col sums
        for j in range(len(table)):
            index += choose(table[i][j], 2)
            b_row += table[j][i]
        #outside of j-loop b.c. we want to use a=rowsums, b=colsums
        sum_a += choose(sum(table[i]), 2)
        sum_b += choose(b_row, 2)

    expected_index = (sum_a*sum_b)
    expected_index = expected_index/choose(n,2)
    max_index = (sum_a+sum_b)
    max_index = max_index/2

    return (index - expected_index)/(max_index-expected_index)

或者，如果您传递带有行和列总和的 table：

def ARI(table):

    index = 0
    sum_a = 0
    sum_b = 0
    n = sum(table[len(table)-1]) + sum([table[i][len(table)-1] for i in range(len(table)-1)])
    for i in range(len(table)-1):
        sum_a += choose(table[i][len(table)-1],2)
        sum_b += choose(table[len(table)-1][i],2)
        for j in range(len(table)-1):
            index += choose(table[i][j],2)

    expected_index = (sum_a*sum_b)
    expected_index = expected_index/choose(n,2)
    max_index = (sum_a+sum_b)
    max_index = max_index/2

    return (index - expected_index)/(max_index-expected_index)

然后

def choose(n,r):
    f = math.factorial
    if (n-r)>=0:
        return f(n) // f(r) // f(n-r)
    else:
        return 0

table = [[1, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 1, 2],
[1, 0, 1, 0, 1, 0, 0, 3],
[0, 0, 0, 0, 0, 0, 1, 1],
[3, 1, 1, 1, 1, 1, 2, 0]]

ARI(table)

ARI(table)
Out[56]: -0.0604008667388949

正确结果！

计算调整后的兰特指数

Computing Adjusted Rand Index

python

debugging

k-means