创建相似度矩阵
Create Similarity Matrix
我有一个如下所示的矩阵:
col_1 col_2 value
A B 2.1
A C 1.3
B C 4.6
A D 1.4
....
我想得到一个相似度矩阵:
A B C D
A X 2.1 1.3 1.4
B 2.1 X 4.6 ...
C ... ... X ...
D ... ... ... X
所以行名和列名是 A、B、C、D,它从第三列取值并将其添加到矩阵
问题还在于原始矩阵的长度约为 10 000 行。
您可以通过以下方式进行。
我在 Python 中编写代码,因为没有指定语言
#I assume that your data is in a python pandas dataframe called df
df = ..load your data
list_of_labels = [ 'A','B','C','D' ]
nb_labels = len(list_of_labels)
similarity = np.zeros( (nb_labels,nb_labels) )
for l1, l2, val in zip( df['col_1'] , df['col_2'] , df['value'] ):
i = list_of_labels.index( l1 )
j = list_of_labels.index( l2 )
similarity[i][j] = val
similarity_df = pd.DataFrame(data=similarity, index=list_of_labels, columns=list_of_labels)
正如 Roland 所建议的,您可以使用 dcast()
:
library(data.table)
dcast(df, col_1 ~ col_2)
## col_1 B C D
## 1 A 2.1 1.3 1.4
## 2 B NA 4.6 NA
其中:
df <- data.frame(
col_1 = c("A", "A", "B", "A"),
col_2 = c("B","C", "C", "D"),
value = c(2.1, 1.3, 4.6, 1.4)
)
与 xtabs
和 mutate_at
。 sparse = TRUE
将输出转换为稀疏矩阵:
library(dplyr)
mat <- df %>%
mutate_at(1:2, factor, levels = unique(c(levels(.$col_1), levels(.$col_2)))) %>%
xtabs(value ~ col_1 + col_2, data=., sparse = TRUE)
mat[lower.tri(mat)] <- mat[upper.tri(mat)]
结果:
4 x 4 sparse Matrix of class "dgCMatrix"
col_2
col_1 A B C D
A . 2.1 1.3 1.4
B 2.1 . 4.6 .
C 1.3 1.4 . .
D 4.6 . . .
我有一个如下所示的矩阵:
col_1 col_2 value
A B 2.1
A C 1.3
B C 4.6
A D 1.4
....
我想得到一个相似度矩阵:
A B C D
A X 2.1 1.3 1.4
B 2.1 X 4.6 ...
C ... ... X ...
D ... ... ... X
所以行名和列名是 A、B、C、D,它从第三列取值并将其添加到矩阵 问题还在于原始矩阵的长度约为 10 000 行。
您可以通过以下方式进行。 我在 Python 中编写代码,因为没有指定语言
#I assume that your data is in a python pandas dataframe called df
df = ..load your data
list_of_labels = [ 'A','B','C','D' ]
nb_labels = len(list_of_labels)
similarity = np.zeros( (nb_labels,nb_labels) )
for l1, l2, val in zip( df['col_1'] , df['col_2'] , df['value'] ):
i = list_of_labels.index( l1 )
j = list_of_labels.index( l2 )
similarity[i][j] = val
similarity_df = pd.DataFrame(data=similarity, index=list_of_labels, columns=list_of_labels)
正如 Roland 所建议的,您可以使用 dcast()
:
library(data.table)
dcast(df, col_1 ~ col_2)
## col_1 B C D
## 1 A 2.1 1.3 1.4
## 2 B NA 4.6 NA
其中:
df <- data.frame(
col_1 = c("A", "A", "B", "A"),
col_2 = c("B","C", "C", "D"),
value = c(2.1, 1.3, 4.6, 1.4)
)
与 xtabs
和 mutate_at
。 sparse = TRUE
将输出转换为稀疏矩阵:
library(dplyr)
mat <- df %>%
mutate_at(1:2, factor, levels = unique(c(levels(.$col_1), levels(.$col_2)))) %>%
xtabs(value ~ col_1 + col_2, data=., sparse = TRUE)
mat[lower.tri(mat)] <- mat[upper.tri(mat)]
结果:
4 x 4 sparse Matrix of class "dgCMatrix"
col_2
col_1 A B C D
A . 2.1 1.3 1.4
B 2.1 . 4.6 .
C 1.3 1.4 . .
D 4.6 . . .